Efficient way to join the nth column of all files in a directory?

173 views Asked by At

A for loop is way to slow. The files have 500k lines. I want to join specifically the 4th column of all files. Appending column after column to the right.

The columns in each file are separated by tab.

col1 col2 col3 col4 col5
a 0 0 -1 0.001
b 1 0  2 0.004
c 2 0 3 0

col1 col2 col3 col4 col5
c 2 0 -9 0.004
s 1 0  5 0.002
d 3 0 3 0.4

col1 col2 col3 col4 col5
r 2 1 0 0.4
j 1 1 1 0.2
r 3 1 2 0.1

I want:

file1 file2 file3
-1 -9 0
2 5 1
3 3 2

I tried first converting to .csv:

for file in $(ls) do awk '{$1=$1}1' OFS=',' ${file} > ${file}.csv done

And then doing this:

eval paste -d, $(printf "<(cut -d, -f4 %s) " *.csv)

But I get this error: paste: /dev/fd/19: Too many open files

I have to join 400 files of 500k lines each.

5

There are 5 answers

1
tripleee On BEST ANSWER

Your OS doesn't allow you to paste that many files in one go. You'll have to break them up into smaller batches. Here's how to simply do one at a time.

for file in *.csv; do
    if [ -e tempfile ]; then
        paste -d, tempfile <(cut -d, -f4 "$file") >tempfile2
        mv tempfile2 tempfile
    else
        cut -d, -f4 "$file" >tempfile
    fi
done
mv tempfile result.csv

As an aside, don't use ls in scripts. You want simply

awk '{$1=$1}1' OFS=',' * > ${file}.csv

... but there is no reason to separately convert each file into CSV. You could fold both operations into one;

rm tempfile
for file in *; do
    case $file in tempfile | tempfile2 | result.csv) continue;; esac
    if [ -e tempfile ]; then
        paste -d, tempfile <(awk '{print $4}' "$file") >tempfile2
        mv tempfile2 tempfile
    else
        awk '{ print $4 }' "$file" >tempfile
    fi
done
mv tempfile result.csv
0
anubhava On

Here is how you can do this in a single awk, which will be far more efficient than a shell loop and all the extra commands inside the loop:

awk -F '\t' '
FNR == 1 {
   fn = FILENAME
   sub(/\.[^.]+$/, "", fn)
   rec[FNR] = (FNR in rec ? rec[FNR] FS : "") fn
   next
}
{
   rec[FNR] = (FNR in rec ? rec[FNR] FS : "") $4
   m = FNR
}
END {
   for (i=1; i<=m; ++i)
      print rec[i]
}' file{1..3}.csv

file1   file2   file3
-1  -9  0
2   5   1
3   3   2
0
Ed Morton On

Using any awk and assuming all of your files have the same number of lines, none are empty, you don't just have tabs in the spaces between fields (per a comment you made) you don't have any empty fields, and you actually want CSV output:

$ cat tst.awk
BEGIN { OFS="," }
FNR == 1 { val = FILENAME }
FNR  > 1 { val = $4 }
{ vals[FNR] = ( FNR in vals ? vals[FNR] OFS : "" ) val }
END {
    for ( i=1; i<=FNR; i++ ) {
        print vals[i]
    }
}

$ awk -f tst.awk file{1..3}
file1,file2,file3
-1,-9,0
2,5,1
3,3,2

If the "weird blank spaces" you mention can be control characters and you have a POSIX awk then change BEGIN { OFS="," } to BEGIN { FS="[[:space:][:cntrl:]]+"; OFS="," } to set FS appropriately or use the equivalent FS="[^[:graph:]]+", whichever you prefer. If you don't have a POSIX awk then FS="[^a-zA-Z_0-9.-]+" might work for you.

0
Daweo On

But I get this error: paste: /dev/fd/19: Too many open files

I have to join 400 files of 500k lines each.

According to Fixing the “Too many open files” Error in Linux | Baeldung on Linux there are two limits related to that error, dubbed Soft and Hard. You might unveil their current values by doing

ulimit -Sn

and

ulimit -Hn

respectively. If latter is greater than 400 you might banish error by setting Soft for high enough value, I would suggest in your case

ulimit -n 512

As this solution is machine-dependant I could not test it, please do

ulimit -n 512 && eval paste -d, $(printf "<(cut -d, -f4 %s) " *.csv)

and write what was effect.

0
dawg On

I have created the following 40 test files:

$ head -3 file_*
==> file_01 <==
Col 1   Col 2   Col 3   Col 4   Col 5
0.56    0.90    0.75    0.25    0.95
0.40    0.26    0.99    0.05    0.06

==> file_02 <==
Col 1   Col 2   Col 3   Col 4   Col 5
0.62    0.18    0.01    0.85    0.29
0.82    0.53    0.99    0.78    0.91

==> file_03 <==
Col 1   Col 2   Col 3   Col 4   Col 5
0.20    0.80    0.97    0.17    0.23
0.87    0.03    0.61    0.88    0.03

...

==> file_40 <==
Col 1   Col 2   Col 3   Col 4   Col 5
0.98    0.12    0.02    0.84    0.36
0.57    0.31    0.65    0.92    0.95

Each has 500,000 lines.

I tested the time performance of each entry in this post with Bash's time test (not the most accurate but relevant.)

I also added two entries that I wrote and edited tripleee's solution so that it produced the same tab delimited result (and corrected a glob issue that was causing it not to complete).

A Ruby:

ruby -e '
BEGIN{files=Hash.new {|h,k| h[k] = []} } 
ARGV.each{|fn| fh=File.open(fn)
    fh.each_line.with_index{|line,i| files[fn]<<line.split[3] if i>0}
}
END{
    puts files.keys.join("\t")
    files.values.transpose.each{|row| puts row.join("\t")}
}' file_* >tst_1

This pipe with GNU awk (for the ENDFILE pattern) and GNU datamash

gawk 'BEGIN{FS=OFS="\t"} 
FNR==1 {printf "%s",FILENAME; next}
{printf "%s%s", OFS, $4}
ENDFILE{print ""}' file_* | datamash transpose >tst_5

I edited tripleee's solution so it runs on my computer and produces the same results:

for file in file_*; do
    if [ -e tempfile ]; then
        paste -d$'\t' tempfile <(awk 'BEGIN{FS="\t"} FNR==1{print FILENAME; next}{ print $4 }' "$file") >tempfile2
        mv tempfile2 tempfile
    else
        awk 'BEGIN{FS="\t"} FNR==1{print FILENAME; next}{ print $4 }' "$file" >tempfile
    fi
done
mv tempfile tst_4

Each of those produce the 'correct' output as I understand it:

$ head file_{1,4,5}
==> tst_1 <==
file_01 file_02 file_03 file_04 file_05 file_06 file_07 file_08 file_09 file_10 file_11 file_12 file_13 file_14 file_15 file_16 file_17 file_18 file_19 file_20 file_21 file_22 file_23 file_24 file_25 file_26 file_27 file_28 file_29 file_30 file_31 file_32 file_33 file_34 file_35 file_36 file_37 file_38 file_39 file_40
0.25    0.85    0.17    0.01    0.89    0.91    0.27    0.27    0.42    0.71    0.59    0.42    0.57    0.13    0.13    0.45    0.31    0.87    0.54    0.55    0.14    0.06    0.06    0.38    0.14    0.11    0.15    0.72    0.07    1.00    1.00    0.28    0.62    0.71    0.09    0.78    0.90    0.90    0.10    0.84

==> tst_4 <==
file_01 file_02 file_03 file_04 file_05 file_06 file_07 file_08 file_09 file_10 file_11 file_12 file_13 file_14 file_15 file_16 file_17 file_18 file_19 file_20 file_21 file_22 file_23 file_24 file_25 file_26 file_27 file_28 file_29 file_30 file_31 file_32 file_33 file_34 file_35 file_36 file_37 file_38 file_39 file_40
0.25    0.85    0.17    0.01    0.89    0.91    0.27    0.27    0.42    0.71    0.59    0.42    0.57    0.13    0.13    0.45    0.31    0.87    0.54    0.55    0.14    0.06    0.06    0.38    0.14    0.11    0.15    0.72    0.07    1.00    1.00    0.28    0.62    0.71    0.09    0.78    0.90    0.90    0.10    0.84

==> tst_5 <==
file_01 file_02 file_03 file_04 file_05 file_06 file_07 file_08 file_09 file_10 file_11 file_12 file_13 file_14 file_15 file_16 file_17 file_18 file_19 file_20 file_21 file_22 file_23 file_24 file_25 file_26 file_27 file_28 file_29 file_30 file_31 file_32 file_33 file_34 file_35 file_36 file_37 file_38 file_39 file_40
0.25    0.85    0.17    0.01    0.89    0.91    0.27    0.27    0.42    0.71    0.59    0.42    0.57    0.13    0.13    0.45    0.31    0.87    0.54    0.55    0.14    0.06    0.06    0.38    0.14    0.11    0.15    0.72    0.07    1.00    1.00    0.28    0.62    0.71    0.09    0.78    0.90    0.90    0.10    0.84

And here are the times of each:

dawg gawk pipe:  real   0m5.697s
dawg Ruby:       real   0m17.668s
anubhava awk:    real   0m24.094s
Ed Morton awk:   real   0m24.345s
tripleee paste:  real   1m21.150s

If performance is your goal, use datamash. On my test, it is over 4x faster than the awk solutions and 14x faster than using a Bash loop with paste. The Ruby is somewhat faster than the awks.

If you want to generate the test files, you can use this script:

#!/bin/bash

cd /tmp 

cnt=499999
for x in {01..40}; do
    fn="file_$x"
    echo "$fn"
    gawk -v cnt="$cnt" 'BEGIN{
        srand()
        OFS="\t"; col_cnt=5
        for(col=1; col<=col_cnt; col++)
            printf "%s%s%s", "Col ",col, (col==col_cnt ? ORS : OFS)
        for(row=1;row<=cnt;row++)
            for(col=1; col<=col_cnt; col++)
                printf "%.2f%s", rand(), (col==col_cnt ? ORS : OFS)
    }' >"$fn"
done