How to filter data from flat file with multiply lines pattern using awk or sed tool?

70 views Asked by At

This is my first post on this site. I have probably not very easy problem with awk or sed language. In my file are data like this:

A
B
C
[Start]D
E
F
[/End]
G
...
[Start]H
I
J
[/End]
...
K

And I need following result:

A
B
C
[Open]D E F[/Close]
G
...
[Open]H I J[/Close]
...
K

For now I have not working awk code:

BEGIN {
    step=0
}

/[\/End]/ {
    if(step==3) print "[/Close]"
    step=0
}

step==2 {
    print
    step=3
}

step==1{
    print
    step=2
}

/[Start]/ {
    print "[Begin]"
    step=1
}

step=0{
    print
}

Many thanks for yours answers. I hope to stay here a little bit longer. Cheers! P.

3

There are 3 answers

0
karakfa On BEST ANSWER

This awk will do most of it, but will leave space before the [\Close]

awk '/Start/{ORS=FS} /End/{ORS=RS} sub(/Start/,"Open") sub(/End/,"Close") 1' file

It's easy to trim that in another pass (pipe previous output to this script)

awk 'sub(/ \[/,"\[") 1'
0
Ed Morton On
$ cat tst.awk
sub(/^\[Start\]/,"[Open]")  { ors=ORS; ORS=OFS }
sub(/^\[\/End\]/,"[Close]") { ORS=ors }
{ print }

$ awk -f tst.awk file
A
B
C
[Open]D E F [Close]
G
...
[Open]H I J [Close]
...
K

If you care about the extra space before each "[Close]" we can do something different but it'll be a bit more complicated., e.g.:

$ cat tst.awk
sub(/^\[Start\]/,"[Open]")  { f=1; rec=$0; next }
sub(/^\[\/End\]/,"[Close]") { f=0; $0=rec $0 }
f { rec = rec OFS $0; next }
{ print }

$ awk -f tst.awk file
A
B
C
[Open]D E F[Close]
G
...
[Open]H I J[Close]
...
K
0
Wintermute On

With sed, you could write (GNU sed syntax, for BSD sed see below):

sed '/\[Start\]/ { s//[Open]/; :a \,\[/End\],! { s/\n/ /; N; ba }; s,,[/Close],; s/\n// }' filename

This is to be read as follows:

/\[Start\]/ {        # If a line contains [Start]
  s//[Open]/         # replace it with [Open] (an empty regex reattempts the most
                     # recently used regex, which was \[Start\])
  :a                 # jump label for looping
  \,\[/End\],! {     # Until we find [/End]
    s/\n/ /          # replace newlines with spaces (this does nothing the first
                     # time around, but since we don't want to replace the last
                     # newline with a space but an empty string, we have to
                     # isolate it somehow; this works for that
    N                # fetch next line, append it to what we already have
    ba               # go back to a
  }
  s,,[/Close],       # replace the [/End] we just found with [/Close]
  s/\n//             # and replace the last newline with nothing, to get the
                     # spaces right.
}

Note that to make this work with BSD sed, the call has to be amended slightly:

 sed -e '/\[Start\]/ { s//[Open]/; :a' -e '\,\[/End\],! { s/\n/ /; N; ba' -e '}; s,,[/Close],; s/\n// }' filename

This is because BSD sed doesn't terminate label names at semicolons the way GNU sed does. Apart from the -e that split the code after label names, it is the same code.

Further note that this will only work as long as the [Start] .. [/End] tags are not nested. If they are, you'll want to ditch sed and awk and use at least Perl (which supports recursion in regexes1).

1 Well, it calls them "regular expressions;" it's a bit of a misnomer because they're not limited to regular languages with all the stuff Perl crams into them. The point is: nested tags aren't a regular language anymore, so you need/want that stuff for it.