Extract N lines with no duplicate strings from either of the two first columns

Question

Extract N lines with no duplicate strings from either of the two first columns

96 views Asked by MrLungo At 16 November 2023 at 08:13

I have a huge CSV file with pairs of unique combinations. I need to extract N random lines of this file, but I need that the strings in col 1&2 to be unique, so that the list of unique strings in col 1&2 strings combined is equal to 2*N.

For example given, this input file:

A,B,0.1747
B,C,0.373
C,D,0.585
E,J,0.8585
E,A,0.5657
F,A,0.5656

Possible desired output for 3 random lines would be:

A,B,0.1747
C,D,0.585
E,J,0.8585

or

B,C,0.373
E,J,0.8585
F,A,0.5656

So far I have tried to reduce the file to unique strings in both col 1 and 2 separately using:

shuf file.csv | awk  -F',' '!a[$1]++' | awk  -F',' '!a[$2]++'

to get

B,C,0.373
E,A,0.5657
A,B,0.1747
C,D,0.585

I don't think this is the correct approach. Is there a way of looping through rows and not consider the row if a string is already present in either of two columns?

Original Q&A

There are 1 answers

**Renaud Pacalet** · Accepted Answer · 2023-11-16T08:50:44+00:00

When you use awk twice there is usually a better solution with only one awk script. You can try:

shuf file.csv |
  awk -F, '!(a[$1]+a[$2]) {a[$1]++; a[$2]++; n++; print} n==200 {exit}'

Variant using in and next:

shuf file.csv |
  awk -F, '$1 in a || $2 in a {next} {a[$1]; a[$2]; n++; print} n==200 {exit}'

Note: your example has no input lines with identical first and second fields. But if this can happen and you want to skip these lines you can use:

shuf file.csv |
  awk -F, '($1!=$2) && !(a[$1]+a[$2]) {a[$1]++; a[$2]++; n++; print}
           n==200 {exit}'

Or:

shuf file.csv |
  awk -F, '$1==$2 || $1 in a || $2 in a {next} {a[$1]; a[$2]; n++; print}
           n==200 {exit}'

TechQA.

Extract N lines with no duplicate strings from either of the two first columns

There are 1 answers

Related Questions in BASH

Related Questions in AWK

Related Questions in SED

Related Questions in GREP

Related Questions in TEXT-PROCESSING

Popular Questions

Trending Questions