grep - RegEx multiple-criteria select

Question

grep - RegEx multiple-criteria select

79 views Asked by root At 16 January 2024 at 02:23

Given a file containing this string:

IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@IT1*1*CS*VN*ABC@SAC*X*500@REF*ZZ*BAR@IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@

The goal is to extract the following:

IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@

With the criteria being:

The IT1 "line" must contain *EA*
The REF line must contain BAR

Some notes for consideration:

"@" can be thought of as a line break
A "group" of lines contains lines starting with IT1 and ending with REF
I am running GNU grep 3.7.

The goal is to select the "group" of lines meeting the criteria.

I tried the following:

grep -oP "IT1[^@]*EA[^@]*@.*REF[^@]*BAR[^@]*@" file.txt

But it captures characters from the beginning of the example.

Also tried to use lookarounds:

grep -oP "(?<=IT1[^@]*EA[^@]*@).*?(?=REF[^@]*BAR[^@]*@)" file.txt

But my version of grep returns:

grep: lookbehind assertion is not fixed length

Original Q&A

There are 2 answers

Ed Morton On 19 January 2024 at 22:12

You don't need PCREs for this, a simple POSIX ERE will do:

$ grep -oE 'IT1[^@]*EA*[^@]*@[^@]*@REF[^@]*BA[^@]*@' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@

If I had to operate on that data, though, I wouldn't use grep on it like that as the regexp becomes lengthy and grep would read the whole input file into memory at once so YMMV with large input files.

Instead I'd use awk to treat it as 3-line records of @-separated fields and then you can trivially do whatever you like with the fields and/or whole records, e.g. using GNU awk for multi-char RS and RT:

$ awk -v RS='([^@]*@){3}' -F'@' '{$0=RT} ($1 ~ /EA/) && ($3 ~ /BAR/)' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@

The above only reads 3 @-separated strings at a time into memory and breaks down the input into these records and fields:

$ awk -v RS='([^@]*@){3}' -F'@' 'RT{$0=RT; print; for (i=1; i<=NF; i++) print "\t" i, $i}' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*OK
        4
IT1*1*CS*VN*ABC@SAC*X*500@REF*ZZ*BAR@
        1 IT1*1*CS*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*BAR
        4
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*BAR
        4
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*OK
        4

There's an empty field at the end of each record due to the @ at the end of each record. That can be trivially handled however you like (removed, ignored, kept, whatever).

If you don't have GNU awk you can do the same using any awk with just slightly more code:

$ awk -v RS='@' -F'@' '{$0=prev $0 RS; prev=(NR%3 ? $0 : "")} !prev && ($1 ~ /EA/) && ($3 ~ /BAR/)' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@

$ awk -v RS='@' -F'@' '{$0=prev $0 RS; prev=(NR%3 ? $0 : "")} !prev{print; for (i=1; i<=NF; i++) print "\t" i, $i}' file
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*OK
        4
IT1*1*CS*VN*ABC@SAC*X*500@REF*ZZ*BAR@
        1 IT1*1*CS*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*BAR
        4
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*BAR
        4
IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@
        1 IT1*1*EA*VN*ABC
        2 SAC*X*500
        3 REF*ZZ*OK
        4

**Nick** · Accepted Answer · 2024-01-16T03:22:46+00:00

Your issue is that .* will match characters from the first IT1 with EA to the last REF with BAR. You need to ensure the match doesn't go past the next IT1, which you can do by replacing .* with a tempered greedy token (?:(?!@IT1).)*:

IT1[^@]*EA[^@]*@(?:(?!@IT1).)*REF[^@]*BAR[^@]*@

This will only match from an IT1 to its corresponding REF.

Regex demo on regex101

TechQA.

grep - RegEx multiple-criteria select

There are 2 answers

Related Questions in REGEX

Related Questions in LINUX

Related Questions in GREP

Related Questions in GNU

Related Questions in PCRE

Popular Questions

Trending Questions