Unix: find all lines having timestamps in both time series?

Question

Unix: find all lines having timestamps in both time series?

228 views Asked by hhh At 17 November 2014 at 02:09

I have time-series data where I would like to find all lines matching each another but values can be different (match until the first tab)! You can see the vimdiff below where I would like to get rid of days that occur only on the other time series.

I am looking for the simplest unix tool to do this!

enter image description here

Timeserie here and here.

Simple example

Input

Left file                            Right File    
------------------------             ------------------------
10-Apr-00     00:00    0     ||      10-Apr-00     00:00     7
20-Apr 00     00:00    7     ||      21-Apr-00     00:00     3

Output

Left file                           Right File    
------------------------            ------------------------
10-Apr-00     00:00    0    ||      10-Apr-00     00:00     7

Original Q&A

There are 3 answers

**John1024** · Answer 1 · 2014-11-17T05:14:03+00:00

Let's consider these sample input files:

$ cat file1
10-Apr-00       00:00   0
20-Apr-00       00:00   7
$ cat file2
10-Apr-00       00:00   7
21-Apr-00       00:00   3

To merge together those lines with the same date:

$ awk 'NR==FNR{a[$1]=$0;next;} {if ($1 in a) print a[$1]"\t||\t"$0;}' file1 file2
10-Apr-00       00:00   0       ||      10-Apr-00       00:00   7

Explanation

NR==FNR{a[$1]=$0;next;}

NR is the number of lines read so far and FNR is the number of lines read so far from the current file. So, when NR==FNR, we are still reading the first file. If so, save this whole line, $0, in array a under the key of the first field, $1, which is the date. Then, skip the rest of the commands and jump to the next line.
if ($1 in a) print a[$1]"\t||\t"$0

If we get here, then we are reading the second file, file2. If the first field on this line, $1 is a date that we already saw in file1, in other words, if $1 in a, then print this line out together with the corresponding line from file1. The two lines are separated by tab-||-tab.

Alternative Output

If you just want to select lines from file2 whose dates are also in file1, then the code can be simplified:

$ awk 'NR==FNR{a[$1]++;next;} {if ($1 in a) print;}' file1 file2
10-Apr-00       00:00   7

Or, still simpler:

$ awk 'NR==FNR{a[$1]++;next;} ($1 in a)' file1 file2
10-Apr-00       00:00   7

**Lars Fischer** · Answer 2 · 2015-11-07T20:11:39+00:00

There is the relatively unknown unix command join. It can join sorted files on a key column.

To use it in your context, we follow this strategy (left.txt and right.txt are your files):

add line numbers (to put everything in the original sequence in the last step)
```
nl left.txt > left_with_lns.txt
nl right.txt > right_with_lns.txt
```

sort both files on the date column

sort left_with_lns.txt -k 2 > sl.txt
sort right_with_lns.txt -k 2 > sr.txt

join the files using the date column (all times are 0:00) (this would merge all columns of both files with correponding key, but we provide a output template to write the columns from the first file somewhere and the columns from the second file somewhere else (but only those line with a matching key will end in the result fl.txt and fr.txt)
```
join -j 2 -t $'\t' -o 1.1 1.2 1.3 1.4 sl.txt sr.txt > fl.txt
join -j 2 -t $'\t' -o 2.1 2.2 2.3 2.4 sl.txt sr.txt > fr.txt
```

sort boths results on the linenumber column and output the other columns

sort -n fl |cut -f 2- > left_filtered.txt
sort -n fr.txt | cut -f 2- > right_filtered.txt

Tools used: cut, join, nl, sort.

**Ruud Helderman** · Answer 3 · 2015-11-07T01:19:10+00:00

As requested by @Masi, I tried to work out a solution using sed.

My first attempt uses two passes; the first transforms file1 into a sed script that is used in the second pass to filter file2.

sed 's/\([^ \t]*\).*/\/^\1\t\/p;t/' file1 > sed1
sed -nf sed1 file2 > out2

With big input files, this is s-l-o-w; for each line from file2, sed has to process an amount of patterns that equals the number of lines in file1. I haven't done any profiling, but I wouldn't be surprised if the time complexity is quadratic.

My second attempt merges and sorts the two files, then scans through all lines in search of pairs. This runs in linear time and consequently is a lot faster. Please note that this solution will ruin the original order of the file; alphabetical sorting doesn't work too well with this date notation. Supplying files with a different date format (y-m-d) would be the easiest way to fix that.

sed 's/^[^ \t]\+/&@1/' file1 > marked1
sed 's/^[^ \t]\+/&@2/' file2 > marked2

sort marked1 marked2 > sorted

sed '$d;N;/^\([^ \t]\+\)@1.*\n\1@2/{s/\(.*\)\n\(.*\)/\2\n\1/;P};D' sorted > filtered
sed 's/^\([^ \t]\+\)@2/\1/' filtered > out2

Explanation:

In the first command, s/^[^ \t]\+/&@1/ appends @1 to every date. This makes it possible to merge the files, keep equal dates together when sorting, and still be able to tell lines from different files apart.
The second command does the same for file2; obviously with its own marker @2.
The sort command merges the two files, grouping equal dates together.
The third sed command returns all lines from file2 that have a date that also occurs in file1.
The fourth sed command removes the @2 marker from the output.

The third sed command in detail:

$d suppresses inappropriate printing of the last line
N reads and appends another line of input to the line already present in the pattern space
/^$[^ \t]\+$@1.*\n\1@2/ matches two lines originating from different files but with the same date
{ starts a command group
s/$.*$\n$.*$/\2\n\1/ swaps the two lines in the pattern space
P prints the first line in the pattern space
} ends the command group
D deletes the first line from the pattern space

The bad news is, even the second approach is slower than the awk approach made by @John1024. Sed was never designed to be a merge tool. Neither was awk, but awk has the advantage of being able to store an entire file in a dictionary, making @John1024's solution blazingly fast. The downside of a dictionary is memory consumption. On huge input files, my solution should have the advantage.

TechQA.

Unix: find all lines having timestamps in both time series?

There are 3 answers

Explanation

Alternative Output

Related Questions in UNIX

Related Questions in VIM

Related Questions in AWK

Related Questions in SED

Related Questions in VIMDIFF

Popular Questions

Popular Tags

Trending Questions