I have a huge CSV file with pairs of unique combinations. I need to extract N random lines of this file, but I need that the strings in col 1&2 to be unique, so that the list of unique strings in col 1&2 strings combined is equal to 2*N.
For example given, this input file:
A,B,0.1747
B,C,0.373
C,D,0.585
E,J,0.8585
E,A,0.5657
F,A,0.5656
Possible desired output for 3 random lines would be:
A,B,0.1747
C,D,0.585
E,J,0.8585
or
B,C,0.373
E,J,0.8585
F,A,0.5656
So far I have tried to reduce the file to unique strings in both col 1 and 2 separately using:
shuf file.csv | awk -F',' '!a[$1]++' | awk -F',' '!a[$2]++'
to get
B,C,0.373
E,A,0.5657
A,B,0.1747
C,D,0.585
I don't think this is the correct approach. Is there a way of looping through rows and not consider the row if a string is already present in either of two columns?
When you use
awktwice there is usually a better solution with only oneawkscript. You can try:Variant using
inandnext:Note: your example has no input lines with identical first and second fields. But if this can happen and you want to skip these lines you can use:
Or: