Process the latest version of a file ..... as the 2nd file of a pair of files that are to be processed

48 views Asked by At

Using gawk I want to process two files in a directory. The first file has a fixed name but whilst the start of the name of the second file is constant the name ends in a date and time stamp, the latter changes everytime the file is created. I want to use the latest version of the second file.

I have seen a post/answer to a similar but less complicated question at how to pass the most recent file from a directory to awk input file? and the code ls -lr 2nd_file_*| tail -n 1 does show me the latest file. However I do not know how to pass the found file name to gawk as the second file.

Currently I type the date/time stamp into the gawk script e.g.

gawk -F[,"\t""}"] '{ do something }' file_1 2nd_file_2024_03_21_[18-21-32] > output_file

Does anyone know how I can do this ? Thanks.

I haven't tried anything as I haven't a clue how to.

2

There are 2 answers

3
markp-fuso On BEST ANSWER

Setting aside the various issues with parsing 'ls' output one simple approach would see the 2nd file/argument (to the awk script) replaced with a subshell invocation of the ls|tail call, eg:

awk '{ do something }' file_1 $( ls -1r 2nd_file_* | tail -n 1 )

NOTE: OP has stated this particular ls|tail combo provides the desired file name so I'm merely copying it here as an example.


To see this in action we'll start with some sample files:

$ head *
==> 2nd_file_2024_03_21 <==
21

==> 2nd_file_2024_03_22 <==
22

==> 2nd_file_2024_03_23 <==
23

==> 2nd_file_2024_03_24 <==
24

==> file_1 <==
line_1

To obtain the latest 2nd_file_* we need a tweak to OP's current ls|tail:

$ ls -1 2nd_file_* | tail -n 1
2nd_file_2024_03_24

Wrapping this in subshell invocation and feeding to a simple awk script that prints each input line to stdout:

$ awk '{ print }' file_1 $( ls -1 2nd_file_* | tail -n 1 )
line_1                                                       # line from file_1
24                                                           # line from 2nd_file_2024_03_24
1
James Brown On

Shell (Bash at least) processes files in lexicographic (or alphabetical) order and as your dates in filenames seem to have leading zeroes (hoping the time parts to have as well) 2nd_file* will give the last file last.

First some test data:

$ for i in file_1 2nd_file_2 2nd_file_1  # test stuff, notice order
> do
>   echo $i > $i                         # one line of content in each file
>   sleep 2                              # some time difference
> done
$ ls -lrt
total 12
-rw-r--r-- 1 james james  7 Mar 28 10:00 file_1
-rw-r--r-- 1 james james 11 Mar 28 10:00 2nd_file_2
-rw-r--r-- 1 james james 11 Mar 28 10:00 2nd_file_1

And some GNU awk:

$ gawk '
{
    if(ARGIND==1 || !((ARGIND+1) in ARGV))  # process 1st and last file
        print FILENAME                      # do something
    else
        nextfile                            # jump to nextfile
}' file_1 2nd_file_*

Output:

file_1
2nd_file_2

From the GNU awk manual: ARGIND The index in ARGV of the current file being processed. Every time gawk opens a new data file for processing, it sets ARGIND to the index in ARGV of the file name. When gawk is processing the input files, 'FILENAME == ARGV[ARGIND]' is always true.

Of course if the files had more than one record in them, above would print more than one line of FILENAME for each file (which you fix by removing the else), so eventually you want to:

$ gawk '
!(ARGIND==1 || !((ARGIND+1) in ARGV)) {  # if not first or last file
    nextfile                             # skip it
}
{
    ;                                    # do stuff   
}' file_1 2nd_file_*