Merge lines which don't match a regex

95 views Asked by At

I have a file which contains logs from the web; a simplified version of it is as follows:

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
Unix
Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
START
Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
Aix
SCO

I have tried a couple of Regex combinations to identify the Accept-Language which is the beginning of every line using the following with awk/sed:

/^[a-z]{2}(-[A-Z]{2})?/
/\*|[A-Z]{1,8}(-[A-Z0-9]{1,8})*/i  
/([^-;]*)(?:-([^;]*))?(?:;q=([0-9]\.[0-9]))?/

So far I haven't managed to get either awk/sed to give me the following results:

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    Unix    Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    STAR    Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    Aix    SCO

Any help is appreciated. The file contains about 1 Million+ records so I'm happy to go down a route that doesn't use sed/awk and improves performance.

3

There are 3 answers

7
James Brown On BEST ANSWER
$ awk '/[a-z]{2}-[A-Z]{2}/ { print b; b=$0; next }  # @xx-XX empty buffer, refill
                           { b=b OFS $0 }           # otherwise append to buffer
                       END { print b }' file        # dump the buffer in the end

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; Unix Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; START Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; Aix SCO

You will get an empty line to start the output with. Also, use tab delimiter on output if so desired: awk -v OFS="\t" ....

1
Rob Davis On

Just for fun, here's a sed solution:

sed -ne 1bgo \
   -e '/^[a-z][a-z]-[A-Z][A-Z]/ { x;p;s/.*//;x; };:go' \
   -e 'H;x;s/^\n//;s/\n/  /;x;${ x;p; }' < input

It works like this:

  • Read each line but instead of printing it right away, save it by appending it to the hold space (H), except remove any newlines that separate it from whatever was already there (x;s/^\n//;s/\n/ /;x). (If you want tabs in your output, put them here where I've put a couple of spaces.)

  • If you come across a line that matches your Accept-Language pattern, flush the hold space before you append anything to it. Print it and clear it (x;p;s/.*//;x). Then proceed as usual with the appending and whatnot.

  • Treat the first and last lines differently from all others: never flush the hold space after reading just the first line (1bgo skips past that, down to the position labeled :go), and always flush the hold space after reading the last line (${ x;p; })

1
Lars Fischer On

Based on the observation, that we can distinguish the two types of lines on the =, you can use this awk script:

file.awk

$0 ~ /=/ { printf("%s%s", v,$0)
           v="\n"
           next
         } 
         { printf("\t%s", $0) } 
END      { printf("\n") }

You use it like this: awk -f file.awk yourfile

  • v is empty for the first line, later it contains the linebreak
  • for lines with an =, we print $0 preceded by v
  • for the other lines (note the next in the first action), we print $0 without the newline but with a \t as separation