Unix: find all lines having timestamps in both time series?

230 views Asked by At

I have time-series data where I would like to find all lines matching each another but values can be different (match until the first tab)! You can see the vimdiff below where I would like to get rid of days that occur only on the other time series.

I am looking for the simplest unix tool to do this!

enter image description here

enter image description here

Timeserie here and here.

Simple example

Input

Left file                            Right File    
------------------------             ------------------------
10-Apr-00     00:00    0     ||      10-Apr-00     00:00     7
20-Apr 00     00:00    7     ||      21-Apr-00     00:00     3

Output

Left file                           Right File    
------------------------            ------------------------
10-Apr-00     00:00    0    ||      10-Apr-00     00:00     7
3

There are 3 answers

0
John1024 On

Let's consider these sample input files:

$ cat file1
10-Apr-00       00:00   0
20-Apr-00       00:00   7
$ cat file2
10-Apr-00       00:00   7
21-Apr-00       00:00   3

To merge together those lines with the same date:

$ awk 'NR==FNR{a[$1]=$0;next;} {if ($1 in a) print a[$1]"\t||\t"$0;}' file1 file2
10-Apr-00       00:00   0       ||      10-Apr-00       00:00   7

Explanation

  • NR==FNR{a[$1]=$0;next;}

    NR is the number of lines read so far and FNR is the number of lines read so far from the current file. So, when NR==FNR, we are still reading the first file. If so, save this whole line, $0, in array a under the key of the first field, $1, which is the date. Then, skip the rest of the commands and jump to the next line.

  • if ($1 in a) print a[$1]"\t||\t"$0

    If we get here, then we are reading the second file, file2. If the first field on this line, $1 is a date that we already saw in file1, in other words, if $1 in a, then print this line out together with the corresponding line from file1. The two lines are separated by tab-||-tab.

Alternative Output

If you just want to select lines from file2 whose dates are also in file1, then the code can be simplified:

$ awk 'NR==FNR{a[$1]++;next;} {if ($1 in a) print;}' file1 file2
10-Apr-00       00:00   7

Or, still simpler:

$ awk 'NR==FNR{a[$1]++;next;} ($1 in a)' file1 file2
10-Apr-00       00:00   7
3
Lars Fischer On

There is the relatively unknown unix command join. It can join sorted files on a key column.

To use it in your context, we follow this strategy (left.txt and right.txt are your files):

  1. add line numbers (to put everything in the original sequence in the last step)

    nl left.txt > left_with_lns.txt
    nl right.txt > right_with_lns.txt
    
  2. sort both files on the date column

    sort left_with_lns.txt -k 2 > sl.txt
    sort right_with_lns.txt -k 2 > sr.txt
    
  3. join the files using the date column (all times are 0:00) (this would merge all columns of both files with correponding key, but we provide a output template to write the columns from the first file somewhere and the columns from the second file somewhere else (but only those line with a matching key will end in the result fl.txt and fr.txt)

    join -j 2 -t $'\t' -o 1.1 1.2 1.3 1.4 sl.txt sr.txt > fl.txt
    join -j 2 -t $'\t' -o 2.1 2.2 2.3 2.4 sl.txt sr.txt > fr.txt
    
  4. sort boths results on the linenumber column and output the other columns

    sort -n fl |cut -f 2- > left_filtered.txt
    sort -n fr.txt | cut -f 2- > right_filtered.txt
    

Tools used: cut, join, nl, sort.

3
Ruud Helderman On

As requested by @Masi, I tried to work out a solution using sed.

My first attempt uses two passes; the first transforms file1 into a sed script that is used in the second pass to filter file2.

sed 's/\([^ \t]*\).*/\/^\1\t\/p;t/' file1 > sed1
sed -nf sed1 file2 > out2

With big input files, this is s-l-o-w; for each line from file2, sed has to process an amount of patterns that equals the number of lines in file1. I haven't done any profiling, but I wouldn't be surprised if the time complexity is quadratic.

My second attempt merges and sorts the two files, then scans through all lines in search of pairs. This runs in linear time and consequently is a lot faster. Please note that this solution will ruin the original order of the file; alphabetical sorting doesn't work too well with this date notation. Supplying files with a different date format (y-m-d) would be the easiest way to fix that.

sed 's/^[^ \t]\+/&@1/' file1 > marked1
sed 's/^[^ \t]\+/&@2/' file2 > marked2

sort marked1 marked2 > sorted

sed '$d;N;/^\([^ \t]\+\)@1.*\n\1@2/{s/\(.*\)\n\(.*\)/\2\n\1/;P};D' sorted > filtered
sed 's/^\([^ \t]\+\)@2/\1/' filtered > out2

Explanation:

  • In the first command, s/^[^ \t]\+/&@1/ appends @1 to every date. This makes it possible to merge the files, keep equal dates together when sorting, and still be able to tell lines from different files apart.
  • The second command does the same for file2; obviously with its own marker @2.
  • The sort command merges the two files, grouping equal dates together.
  • The third sed command returns all lines from file2 that have a date that also occurs in file1.
  • The fourth sed command removes the @2 marker from the output.

The third sed command in detail:

  • $d suppresses inappropriate printing of the last line
  • N reads and appends another line of input to the line already present in the pattern space
  • /^\([^ \t]\+\)@1.*\n\1@2/ matches two lines originating from different files but with the same date
  • { starts a command group
  • s/\(.*\)\n\(.*\)/\2\n\1/ swaps the two lines in the pattern space
  • P prints the first line in the pattern space
  • } ends the command group
  • D deletes the first line from the pattern space

The bad news is, even the second approach is slower than the awk approach made by @John1024. Sed was never designed to be a merge tool. Neither was awk, but awk has the advantage of being able to store an entire file in a dictionary, making @John1024's solution blazingly fast. The downside of a dictionary is memory consumption. On huge input files, my solution should have the advantage.