New to awk and sed, How could I improve this? Multiple sed and awk commands

362 views Asked by At

This is the script I've constructed

  • It takes a list of files according to the extension supplied as an argument.

  • It then removes everything before the pattern 00000000: in those files.

  • The pattern 00000000: is preceded by the string <pre>, it then removes those five first characters.
  • The script then removes the last three lines of the file
  • The script the outputs only the hexdump data of the file.
  • The script runs xxd to convert the hexdump to a file.jpg

    if [[ $# -eq 0 ]] ; then
        echo 'Run script as ./hexconv ext'
        exit 0
    fi

    for file in *.$1
    do
        filename=$(basename $file)
        extension="${filename##*.}"
        filename="${filename%.*}"

        sed -n '/00000000:/,$p' $file | sed '1s/^.....//' | head -n -3 | awk '{print $2" "$3" "$4" "$5" "$6" "$7" "$8" "$9" "$10" "$11" "$12" "$13" "$14" "$15" "$16" "$17}' | xxd -p -r > $filename.jpg
    done

It works as I want it too, but I suspect there are things to improve it by, but alas, I am a novice in the use of awk and sed.

Excerpt from file

<th>response-head:</th>
<td>HTTP/1.1 200 OK
Date: Sun, 15 Dec 2013 04:27:04 GMT
Server: PWS/8.0.18
X-Px: ms h0-s34.p6-lhr ( h0-s35.p6-lhr), ht-d h0-s35.p6-lhr.cdngp.net
Etag: &quot;4556354-9fbf8-4e40387aadfc0&quot;
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0, max-age=0
Accept-Ranges: bytes
Content-Length: 654328
Content-Type: image/jpeg
Last-Modified: Thu, 15 Aug 2013 21:55:19 GMT
Pragma: no-cache
</td>
  </tr>
</table>
<hr/>
<pre>00000000:  ff  d8  ff  e0  00  10  4a  46  49  46  00  01  01  01  00  48  ......JFIF.....H
00000010:  00  48  00  00  ff  e1  00  18  45  78  69  66  00  00  49  49  .H......Exif..II
00000020:  2a  00  08  00  00  00  00  00  00  00  00  00  00  00  ff  ed  *...............
00000030:  00  48  50  68  74  73  68  70  20  33  2e  30  00  .HPhotoshop 3.0.
00000040:  38  42  49  4d  04  04  00  00  00  00  00  1c  01  5a  00  8BIM..........Z.
00000050:  03  1b  25  47  1c  02  00  00  02  00  02  00  38  42  49  4d  ..%G........8BIM
00000060:  04  25  00  00  00  00  00  10  fc  e1  89  c8  b7  c9  78  .%.............x
00000070:    34  62  34  07  58  77  eb  ff  e1  03  a5  68  74  74  70  /4b4.Xw.....http
00000080:  3a    6e  73  2e  61  64  62  65  2e  63  6d  ://ns.adobe.com/
00000090:  78  61  70  31  2e  30  00  3c  78  70  61  63  6b  xap/1.0/.&lt;?xpack
000000a0:  65  74  20  62  65  67  69  6e  3d  22  ef  bb  bf  22  20  69  et begin="..." i
000000b0:  64  3d  22  57  35  4d  30  4d  70  43  65  68  69  48  7a  72  d="W5M0MpCehiHzr
000000c0:  65  53  7a  4e  54  63  7a  6b  63  39  64  22  3e  20  3c  eSzNTczkc9d"?&gt; &lt;
000000d0:  78  3a  78  6d  70  6d  65  74  61  20  78  6d  6c  6e  73  3a  x:xmpmeta xmlns:
000000e0:  78  3d  22  61  64  62  65  3a  6e  73  3a  6d  65  74  61  x="adobe:ns:meta
000000f0:    22  20  78  3a  78  6d  70  74  6b  3d  22  41  64  62  /" x:xmptk="Adob
00000100:  65  20  58  4d  50  20  43  72  65  20  35  2e  30  2d  63  e XMP Core 5.0-c
00000110:  30  36  31  20  36  34  2e  31  34  30  39  34  39  2c  20  32  061 64.140949, 2
00000120:  30  31  30  31  32  30  37  2d  31  30  3a  35  37  3a  010/12/07-10:57:

2

There are 2 answers

2
janos On BEST ANSWER

Although @CodeGnome is right and this might belong to Code Review SE, here you go anyway:

  1. Slightly more efficient to combine the multiple sed commands into one, for example:

    sed -n -e 's/^<pre>//' -e '/00000000:/,$p'
    

    I decided to retract this part, as I'm not all that sure it's any better or clearer. Your version is fine, except that s/^<pre>// is better than s/^.....//.

  2. Use exit 1 when checking the number of arguments to signal an error

  3. What is for file in *. there? Iterate for all files ending with a dot? Typo?

  4. Unless you're 100% sure the filenames will never contain spaces, you should quote them, but don't quote where you don't need, for example:

    filename=$(basename "$file")  # need to quote
    extension=${filename##*.}     # no need, 
    filename=${filename%.*}       # no need
    sed ... "$file"               # need to quote
    ... | xxd > "$filename".jpg   # need to quote
    
  5. The last awk could be shorter and less error prone as a loop:

    ... | awk '{printf $2; for (i=3; i<=17; ++i) printf " " $i; print ""}'
    

It seems you want to learn. You might be interested in this other answer too: What are the rules to write robust shell scripts?

0
Mark Reed On

The error message should be sent to stderr, should not hard-code the name of the script in case you rename it later, and should exit with a nonzero value.

if (( ! $# )); then
  echo >&2 "Run script as '$0' \$extension"
  exit 1
fi

If you're going to put the then on the same line as the if, then you should put the do on the same line as the for, too, for consistency:

for file in *.$1; do

Using file for the full name and filename for the basename is confusing variable name choice. I would use basename for the variable, to match the operation. And you need to quote the parameter expansion:

    basename=$(basename "$file")

But you don't need to quote the right hand side of an assignment:

    extension=${basename##*.}

The part of a filename without the extension is sometimes called the root (in vi and csh :-modifiers, you get it with :r)... using that name would be less confusing than changing an existing variable and reusing it:

    root=${basename%.*}

As far as the actual pipeline, I would reorder it to put the head before the awk, since the sed and the head are all about what lines to print out and should be grouped together before the awk which modifies those selected lines. I would also use a loop and printf to make the awk a little more wieldy:

    sed -n '/0\{8\}:/,$p' "$file" | 
      head -n -3 | 
      awk '{ printf "%s", $2; for (f=3;f<=17;++f) { printf " %s", $f }; print "" }' | 
      xxd -p -r > "$root.jpg"
done