Counting Records in Linux Files Excluding Some Files

69 views Asked by At

I have to count the number of records I have in 6 files, each file contains 4 million records (the count should be as fast as possible), however there is another file with a similar name which should be omitted.

fileSales_1.txt (4 million records)

fileSales_2.txt (4 million records)

fileSales_3.txt (4 million records)

fileSales_4.txt (4 million records)

fileSales_5.txt (4 million records)

fileSales_6.txt (4 million records)

fileSales_unique.txt (24 million records)

I'm counting the logs with the following command: awk 'END {pint NR}' fileSales_*.txt

However, in doing so, the fileSales_unique.txt archive also counts, giving a total of 48 million records

Could you help me with an instruction which only counts the number of records for files 1 to 6? The result should be 24 million records, awk 'END {pint NR}' fileSales_(1 to 6).txt

2

There are 2 answers

0
dawg On BEST ANSWER

Suppose you have these files (using wc to show both file names and size):

 4000000 fileSales_1.txt
 4000000 fileSales_2.txt
 4000000 fileSales_3.txt
 4000000 fileSales_4.txt
 4000000 fileSales_5.txt
 4000000 fileSales_6.txt
 24000000 fileSales_unique.txt
 24000000 fileSales_unique_also.txt
 72000000 total

There are many ways to achieve your goal, but two primary ones:

  1. Use a glob that only includes the desired files;
  2. Use an exclusion list or pattern that excludes the the undesired files.

Inclusion glob:

  1. wc -l fileSales_{1..6}.txt
  2. wc -l fileSales_?.txt
  3. wc -l fileSales_[1-6].txt

Any of those:

$ wc -l fileSales_[1-6].txt  
 4000000 fileSales_1.txt
 4000000 fileSales_2.txt
 4000000 fileSales_3.txt
 4000000 fileSales_4.txt
 4000000 fileSales_5.txt
 4000000 fileSales_6.txt
 24000000 total

(Same concept applies to awk)

Or, maintain a skip array in Bash:

skip=( *_unique* )
to_cnt_files=()
for fn in fileSales*.txt; do 
    [[ "${skip[@]/$fn/}" != "${skip[@]}" ]] && continue
    to_cnt_files+=( "$fn" )
done

Then your method works:

awk 'END{print NR}' $(printf "%s\n" "${to_cnt_files[@]}")
# 24000000

Know that wc in this case will be monumentally faster than awk likely...

0
Romeo Ninov On

As mentioned in comments you do not need awk to count the records. You can use wc:

wc -l fileSales_?.txt

This will accept all filenames which start with fileSales_ then one symbol and then .txt. If you want to limit to numbers you can use:

wc -l fileSales_[1-6].txt

The same with awk:

awk 'END {print NR}' fileSales_?.txt