Count duplicates from several files

1.2k views Asked by At

I have five files which contain some duplicate strings.

file1:

a

file2:

b

file3:

a
b

file4:

b

file5:

c

So i used awk 'NR==FNR{A[$0];next}$0 in A' file1 file2 file3 file4 file5

And it prints $ a, but as you see there is b string 3 times repeated in other files, but print only a.

So how to get all repeated string (a b) from analysing/comparing every file with each other using one line command? Also how do I get the number of repeats for each element.

3

There are 3 answers

1
Mustafa DOGRU On BEST ANSWER

you can use one of these;

awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5

or

awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5

you could test this for a=3 and b=4.

awk '{count[$0]++} END {for (line in count) if ( count[line] == 3 && line == "a" || count[line] == 4 && line == "b" ) {print line} }' file1 file2 file3 file4 file5

test:

$ awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5
a
b


$ awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5
a
b

$ awk '{count[$0]++} END {for (line in count) if ( count[line] == 2 && line == "a" || count[line] == 3 && line == "b" ) {print line, count[line]} }' 1 2 3 4 5
a 2
b 3
5
Cyrus On

I suggest with GNU sort and uniq:

sort file[1-5] | uniq -dc

Output:

2 a
3 b

From man uniq:

-d: only print duplicate lines

-c: prefix lines by the number of occurrences

0
James Brown On

In awk:

$ awk '{ a[$1]++ } END { for(i in a) if(a[i]>1) print i,a[i] }' file[1-5]
a 2
b 3

It counts the occurrances of each record (character in this case) and prints out the ones with count more than one.