Multiple input files - loop through one and check if string contained in second file - output paragraph

300 views Asked by At

I try to filter a text file based on a second file. The first file contains paragraphs like:

$ cat paragraphs.txt
# ::id 1
# ::snt what is an example of a 2-step garage album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (g / garage)
            :mod (s / step-01
                  :quant 2)))

# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))

The second file contains a list of strings like this:

$ cat list.txt
# ::snt what is an example of a abwe album
# ::snt what is an example of a acid techno album

I now want to filter the first file and only keep the paragraphs, if the snt is contained in the second file. For the short example above, the output file would look like this (paragraphs separated by empty line):

$ cat filtered.txt
# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))

So, I tried to loop through the second file and used awk to print out the paragraphs, but apparently the check does not work (all paragraphs are printed) and in the resulting file the paragraphs are contained multiple times. Also, the loop does not terminate... I tried this command:

while read line; do awk -v x=$line -v RS= '/x/' paragraphs.txt ; done < list.txt >> filtered.txt

I also tried this plain awk script:

awk -v RS='\n\n' -v FS='\n' -v ORS='\n\n' 'NR==FNR{a[$1];next}{for(i in a)if(index($0,i)) print}' list.txt paragraphs.txt > filtered.txt

But, it only takes the first line of the list.txt file.

Therefore, I need your help... :-)


UPDATE 1: from comments made by OP:

  • ~526,000 entries in list.txt
  • ~555,000 records in paragraphs.txt
  • all lines of interest start with # ::sn (list.txt, paragraphs.txt)
  • matching will always be performed against the 2nd line of a paragraph (paragraphs.txt)

UPDATE 2: after trying the solutions on the files as stated in first update (4th-run timing):

fastest command:

awk -F'\n' 'NR==FNR{list[$0]; next} $2 in list' list.txt RS= ORS='\n\n' paragraphs.txt
time: 8,71s user 0,35s system 99% cpu 9,114 total

second fastest command:

awk 'NR == FNR { a[$0]; next }/^$/ { if (snt in a) print rec; rec = snt = ""; next }/^# ::snt / { snt = $0 }{ rec = rec $0 "\n" }' list.txt paragraphs.txt
time: 14,17s user 0,35s system 99% cpu 14,648 total

third fastest command:

awk 'FNR==NR { if (NF) a[$0]; next }/^$/    { if (keep_para) print para; keep_para=0; para=sep=""}$0 in a { keep_para=1 }{ para=para $0 sep; sep=ORS }END{ if (keep_para) print para }' list.txt paragraphs.txt
time: 15,33s user 0,35s system 99% cpu 15,745 total
3

There are 3 answers

11
Ed Morton On BEST ANSWER

Using any awk:

$ awk -F'\n' 'NR==FNR{list[$0]; next} $2 in list' list.txt RS= ORS='\n\n' paragraphs.txt
# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))

I'm setting RS and ORS for the 2nd file only as that's the one we want to read/print using paragraph mode but I'm setting FS for all input files to additionally make reading of the first file a bit more efficient as awk then won't waste time splitting each line into fields.

The main problem with your awk script is you were setting RS and ORS for all input files instead of only setting them for the second one. Also note that RS='\n\n' requires a version of awk that supports multi-char RS while RS='' will work in any awk, see https://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line.

Regarding the while read line; script in your question - see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for the issues with doing that. Also, in regards to '/x/' see Example of testing the contents of a shell variable as a regexp: at How do I use shell variables in an awk script?.

14
markp-fuso On

Assumptions:

  • paragraphs in the paragraphs.txt file are separated by at least one blank line
  • matches are performed on entire lines
  • contents of lines are not known in advance (additional comments from OP negate this assumption)
  • entries from list.txt could appear anywhere in a paragraph (additional comments from OP negate this assumption)

A couple issues with the current code:

  • for the while/awk loop try replacing /x/ with $0 ~ x; also make sure you wrap your bash variable reference in double quotes (ie, -v x=$line should be -v x="$line"); though a single awk call is going to be more efficient (it only requires a single pass through each file).

  • for the 2nd awk script -v RS='\n\n' -v FS='\n' -v ORS='\n\n' is going to apply to both input files so you won't be parsing list.txt correctly.

One awk idea:

awk '
FNR==NR { if (NF) a[$0]; next }             # if non-blank line then use entire line as array index
/^$/    { if (keep_para) print para         # blank line: if some part of current paragraph was found in a[] then print paragraph
          keep_para=0; para=sep=""          # reset variables
        }
$0 in a { keep_para=1 }                     # if current line found in a[] then set flag
        { para=para $0 sep; sep=ORS }       # save current line as part of current paragraph
END     { if (keep_para) print para }       # flush last paragraph to stdout?
' list.txt paragraphs.txt

NOTE: with the negation of some original assumptions this generalized approach will be less performant than other answers based on content specific to OP's particular data set

This generates:

# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))
0
M. Nejat Aydin On

You may try this:

awk '
    NR == FNR { a[$0]; next }
         /^$/ { if (snt in a) print rec; rec = snt = ""; next }
  /^# ::snt / { snt = $0 }
              { rec = rec $0 "\n" }
' list.txt paragraphs.txt

This assumes that records in paragraphs.txt are separated by empty lines as well as the last record ends with an empty line.