Comparing contents of text files ignoring order and format

2.2k views Asked by At

I have two text files that I need to compare the contents of because one of them is missing 2 items that the other has, but I'm not sure which since they are long. I've tried diff and vimdiff with no luck. My files are both formatted like this in a jumbled order:

item1    item2    item3
item8    item10   item6
item32   item12   item7

How can I pick out which items one of text files has but the other lacks while ignoring the format and order?

3

There are 3 answers

0
VIRA On

I believe you can use comm command.. but you should have both files in sorted order to compare:

comm -23 f1 f2 # will give whatever lines not matching in file1 against file2
comm -12 f1 f2 # will give matching lines
comm -13 f1 f2 # will give whatever lines not matching in file2 against file 1
0
Rahul Verma On

Use comm to compare your file in order to find what's common or distinct in them.

$ cat file1
item1    item2    item3
item8    item10   item6
item32   item12   item5

$ cat file2
item1    item2    item3
item8    item15   item6
item32   item12   item7

comm -23 file1 file2 returns lines which are in file1 but not in file2
comm -13 file1 file2 returns lines which are in file2 but not in file1
comm -12 file1 file2 returns lines which common in both files

comm requires input files to be sorted. We'll be first converting spaces to \n via sed and then sorting via sort.

$ comm -23 <(sed 's/ \+/\n/g' file1 | sort ) <(sed 's/ \+/\n/g' file2 | sort)
item10
item5

$ comm -13 <(sed 's/ \+/\n/g' file1 | sort ) <(sed 's/ \+/\n/g' file2 | sort)
item15
item7

$ comm -12 <(sed 's/ \+/\n/g' file1 | sort ) <(sed 's/ \+/\n/g' file2 | sort)
item1
item12
item2
item3
item32
item6
item8

-- My answer ends here. ---

But just for information, man page of comm says :

   With no options, comm produce three-column output.  Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

Therefore:

$ comm  <(sed 's/ \+/\n/g' file1 | sort ) <(sed 's/ \+/\n/g' file2 | sort)
                item1
item10
                item12
        item15
                item2
                item3
                item32
item5
                item6
        item7
                item8
0
markp-fuso On

Cyrus' example is by far shorter and more to the point, but thought I'd practice some (verbose) awking ...

Sample data:

$ cat file1
         item2    item3
item8    item10   item6
item32   item12   item7

$ cat file2
item1    item2    item3
item8             item6
         item12   item7

Assumptions:

  • while the description said some items could me missing from one file, I'm going to assume there could be items missing from both files
  • not going to worry about sorting (for input or output)
  • without guidance on how to display output I'll just do my own thang, to include displaying the name of the file that the item is missing from

One possible awk solution:

$ cat text.awk
BEGIN { RS="" }

NR==FNR { afile=FILENAME ; for (i=1;i<=NF;i++) a[$i]=1 ; next }
        { bfile=FILENAME ; for (i=1;i<=NF;i++) b[$i]=1        }

END {
    for (x in a)
        { if ( ! b[x] )
             { printf "missing from %s : %s\n",bfile,x }
        }
    for (x in b)
        { if ( ! a[x] )
             { printf "missing from %s : %s\n",afile,x }
        }
}
  • RS="" : set row separator (RS) to the empty string; this turns a file into one long record
  • NR==NFR : if this is the first (of two) files ...
  • afile=FILENAME : save filename for later printing
  • for/a[$i]=1 : use input fields 1-NF as indexes for associative array a, setting array value to 1 (aka 'true')
  • next : read next record, which in this case means read next file
  • NR!=FNR : if this is the second (of two) files ...
  • same processing except populate bfile and associative array b
  • END ... : process our arrays ...
  • for (x in a) : loop through the indexes of array a and assign to variable x, and if there's no comparable indexed entry in array b (! b[x]) then print a message about the array index (actually name of item from original file) missing from bfile
  • for (x in b) : same as previous loop except checking for items in bfile but not in afile

This awk script in action:

$ awk -f text.awk file1 file2
missing from file2 : item10
missing from file2 : item32
missing from file1 : item1

# switch the order of the input files => same messages, just different order
$ awk -f text.awk file2 file1
missing from file1 : item1
missing from file2 : item10
missing from file2 : item32