Beginner here. Trying to get certain lines from an xml file to print with awk and/or sed and I need help.
I have an xml file like this:
<item id="26141427">
<properties>
<name>233D_camB_take02.mov</name>
<path>/Dailies Released/VT096_DAY41_2011_10_27</path>
<description>HI CU AARON PREPPING CAMERA</description>
<status></status>
<approved />
<created_by id="20184437">
<name>Movie</name>
</created_by>
<created_timestamp>2011-10-28T21:04:51Z</created_timestamp>
<modified_by id="17929743">
<name>Some dude</name>
</modified_by>
<modified_timestamp>2011-10-31T14:59:54Z</modified_timestamp>
<width>1280</width>
<height>720</height>
<timebase>24</timebase>
<mime_type>video/quicktime</mime_type>
</properties>
<attributes>
<attribute key="Camera">B</attribute>
<attribute key="Description">HI CU AARON PREPPING CAMERA</attribute>
<attribute key="End">16:40:32:00</attribute>
<attribute key="Name">233D-2B</attribute>
<attribute key="Notes"></attribute>
<attribute key="Scene">233D</attribute>
<attribute key="Shoot_Date">10/27/2011</attribute>
<attribute key="Shoot_Day">41</attribute>
<attribute key="Start">16:37:52:00</attribute>
<attribute key="Take">2</attribute>
<attribute key="Tape">VT096</attribute>
</attributes>
<tags />
<notes />
</item>
What I need is to print the lines:
<item id="26141427">
<name>233D_camB_take02.mov</name>
<attribute key="Name">233D-2B</attribute>
In the end I need this in a document:
item id="26141427"
233D_camB_take02.mov
233D-2B
Followed by a blank line and then the next item. There are multiple items in the document.
Some things to note, there may be multiple <name> </name> tags but I only need the ones with the string ".mov" present. That string will always be present in every item but will only be present once in every item.
However, as can be seen in the example above, there may or may not be other lines like <name>movie</name> and <name>Some dude</name>. These need to be ignored. So while the other entries I'm looking for can be found by searching for their tags, it is probably better to find that entry by looking for the ".mov" string.
Also, there may or may not be a <attribute="name">some value</attribute> entry. If it is there, I need it. All other <attribute="something"> tags need to be ignored.
Lastly, because each item may or may not have certain entries, this cannot be done by a line number algorithm but needs to be done by search for patterns.
So, in summary:
<item id="123456"> - Will always be present and will only be present once per item. I need the output: item id=123456
<name>something.mov</name> - Will always be present but only once with the ".mov" string. May or may not be present with other strings. I need the output: something.mov. Other instances should be ignored.
<attribute="name">something</attribute> May or may not be present. If it is there, I need the output: something
What I have so far is this:
sed -n '/<item id="/,/>/p' marcherdailiescopy.xml |
awk '{sub("<properties>",""); print}' |
awk '{sub("<",""); print}' |
awk '{sub(">",""); print}'
My first problem is that the sed command returns the item id but also returns the tags ans the next <properties> tag followed by the next item like this:
<item id="27385774">
<properties>
So I'm using awk to strip out the extra strings and characters there, but I know there is a more efficient way to do this. I also don't know how to get awk or sed to grab the strings I need in order so it places them together. I can get:
item id
item id
item id
...
...
value.mov
value.mov
value.mov
...
...
...
But I need:
item id
value.mov
name (if it is there)
item id
value.mov
name (if it is there)
...
...
I also don't know whether it would be more efficient to delete everything other than what I need or grab only what I need. Any help would be Kool and the Gang!
![Cheesy :D](https://www.computerhope.com/forum/Smileys/classic/cheesy.gif)
Thanks,
Dan