grep - RegEx multiple-criteria select

79 views Asked by At

Given a file containing this string:

IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@IT1*1*CS*VN*ABC@SAC*X*500@REF*ZZ*BAR@IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@

The goal is to extract the following:

IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@

With the criteria being:

  1. The IT1 "line" must contain *EA*
  2. The REF line must contain BAR

Some notes for consideration:

  • "@" can be thought of as a line break
  • A "group" of lines contains lines starting with IT1 and ending with REF
  • I am running GNU grep 3.7.

The goal is to select the "group" of lines meeting the criteria.

I tried the following:

grep -oP "IT1[^@]*EA[^@]*@.*REF[^@]*BAR[^@]*@" file.txt

But it captures characters from the beginning of the example.

Also tried to use lookarounds:

grep -oP "(?<=IT1[^@]*EA[^@]*@).*?(?=REF[^@]*BAR[^@]*@)" file.txt

But my version of grep returns:

grep: lookbehind assertion is not fixed length

2

There are 2 answers

1
Nick On BEST ANSWER

Your issue is that .* will match characters from the first IT1 with EA to the last REF with BAR. You need to ensure the match doesn't go past the next IT1, which you can do by replacing .* with a tempered greedy token (?:(?!@IT1).)*:

IT1[^@]*EA[^@]*@(?:(?!@IT1).)*REF[^@]*BAR[^@]*@

This will only match from an IT1 to its corresponding REF.

Regex demo on regex101

0
Ed Morton On

You don't need PCREs for this, a simple POSIX ERE will do:

$ grep -oE 'IT1[^@]*EA*[^@]*@[^@]*@REF[^@]*BA[^@]*@' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@

If I had to operate on that data, though, I wouldn't use grep on it like that as the regexp becomes lengthy and grep would read the whole input file into memory at once so YMMV with large input files.

Instead I'd use awk to treat it as 3-line records of @-separated fields and then you can trivially do whatever you like with the fields and/or whole records, e.g. using GNU awk for multi-char RS and RT:

$ awk -v RS='([^@]*@){3}' -F'@' '{$0=RT} ($1 ~ /EA/) && ($3 ~ /BAR/)' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@

The above only reads 3 @-separated strings at a time into memory and breaks down the input into these records and fields:

$ awk -v RS='([^@]*@){3}' -F'@' 'RT{$0=RT; print; for (i=1; i<=NF; i++) print "\t" i, $i}' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*OK
        4
IT1*1*CS*VN*ABC@SAC*X*500@REF*ZZ*BAR@
        1 IT1*1*CS*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*BAR
        4
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*BAR
        4
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*OK
        4

There's an empty field at the end of each record due to the @ at the end of each record. That can be trivially handled however you like (removed, ignored, kept, whatever).

If you don't have GNU awk you can do the same using any awk with just slightly more code:

$ awk -v RS='@' -F'@' '{$0=prev $0 RS; prev=(NR%3 ? $0 : "")} !prev && ($1 ~ /EA/) && ($3 ~ /BAR/)' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@

$ awk -v RS='@' -F'@' '{$0=prev $0 RS; prev=(NR%3 ? $0 : "")} !prev{print; for (i=1; i<=NF; i++) print "\t" i, $i}' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*OK
        4
IT1*1*CS*VN*ABC@SAC*X*500@REF*ZZ*BAR@
        1 IT1*1*CS*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*BAR
        4
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*BAR
        4
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*OK
        4