Remove duplicates ignoring specific columns

72 views Asked by At

I want to remove all duplicates from a file but ignoring the first 2 columns, I mean don't comparing those columns.

This is my example input:

111  06:22  apples, bananas and pears
112  06:28  bananas
113  07:07  apples, bananas and pears
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears
117  09:22  apples, bananas and pears
118  12:23  apples and bananas

I want this output:

111  06:22  apples, bananas and pears
112  06:28  bananas
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears

I've tried this bellow, but it only compares the third column and ignores the rest of the line:

awk '!seen[$3]++' sample.txt
2

There are 2 answers

0
konsolebox On BEST ANSWER

Store $0 to a temporary variable, set $1 and $2 to empty, then use newly composed $0 as key:

awk '{ t = $0; $1 = $2 = "" } !seen[$0]++ { print t }' sample.txt
0
Daweo On

You might use substr string function to get desired part of line for comparison, let file.txt content be

111  06:22  apples, bananas and pears
112  06:28  bananas
113  07:07  apples, bananas and pears
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears
117  09:22  apples, bananas and pears
118  12:23  apples and bananas

then

awk '!arr[substr($0,11)]++' file.txt

gives output

111  06:22  apples, bananas and pears
112  06:28  bananas
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears

Explanation: get lines which are unique by getting substring of whole line ($0) starting at 11th character.

(tested in GNU Awk 5.0.1)