Word count command returning incorrect row count for CSV files

389 views Asked by At

I am running the following command to count the number of rows in a CSV file:

wc -l filename.csv

Sometimes the row count is correct and other times it is 1 less than the actual count. All of the files are in the same format with the same header row. My first suspicion is the header row but it doesn't seem like the command would differentiate between the header and other rows. The files are saved with the same encoding utf-8.

Is this an issue with the formatting of the CSV files and/or a nuance of the wc command?

4

There are 4 answers

3
markp-fuso On BEST ANSWER

Assumptions:

  • OP's issue is the occasional file that's missing a \n at the end of the last line
  • the .csv data fields do not include embedded linefeeds (\n); otherwise the count will return the number of lines in the file as opposed to the number of records in the file

Setup some test files:

$ printf "a\nb\nc"   > file1            # no \n on the last line
$ printf "a\nb\nc\n" > file2

$ od -c file1
0000000   a  \n   b  \n   c  
0000005                      ^^-- terimination is missing

$ od -c file2
0000000   a  \n   b  \n   c  \n
0000006                      ^^-- terminate last line

We can see that wc -l 'misses' the last line from file1:

$ wc -l file1
2 file1

$ wc -l file2
3 file2

One awk approach that works for both files:

$ awk 'END {print NR}' file1
3

$ awk 'END {print NR}' file2
3

An alternative approach would have the OP go through and append a \n to the last line of files where the \n is missing.

A web search on bash insure last line has linefeed would be a good place to start looking for solutions ...

2
Gilles Quénot On

With a real CSV parser, included with csvkit:

$ cat file.csv
col1,col2,col3
aa,bb,"c
c"

$ csvstat --count file.csv
1

Can be installed on all OSes with:

pip install csvkit

On Debian and derivatives:

$ apt show csvkit
[...]
Package: csvkit
Architecture: all
Version: 1.0.6-1
Priority: optional
Depends: python3:any, python3-csvkit
Size: 27172
Homepage: https://github.com/wireservice/csvkit
Description-en: command-line tools for working with CSV
 csvkit is a suite of command-line tools for converting to and working with
 CSV, the king of tabular file formats.
 .
 It is inspired by pdftk, gdal and the original csvcut tool by Joe Germuska and
 Aaron Bycoffe.
 .
 If you need to do more complex data analysis than csvkit can handle, use
 agate.
 .
 This package provides the command-line tools.
0
dawg On

Suppose you have:

cat file
Côl 1,Cól 2,Cöl 3
1,2,æ
3,4,ç
5,6,"new 
line"

This CSV file has a header and a field with a new line. If you run wc -l on that file, it counts the number of \n encountered -- even in a field. If there is a terminating \n on the last line or not will potentially throw the count off by one.

You can only use wc -l when 1) You know there are no fields with newlines and 2) There is a proper terminating \n at the end of the last line. Granted -- most CSV files qualify, until they do not.

You can use Ruby to count the actual data parsed in a CSV file:

ruby -r csv -e '
puts CSV.parse($<.read, **{:headers=>false}).count
' file

Prints:

4

If you don't want to count the header as a data line:

ruby -r csv -e '
puts CSV.parse($<.read, **{:headers=>true}).count
' file
# 3

Or use Miller:

mlr --csv count file
count
3

Both Miller and Ruby CSV work regardless of whether there is a \n in a field or a terminating \n on the last line. Even if a field has a \n in it, both Miller and Ruby count that as a single record.

0
RARE Kpop Manifesto On

maybe this would work :

 for __ in $'col1,col2,col3\naa,bb,"c\nc"'  
           $'col1,col2,col3\naa,bb,"c\nc"\n' 
           $'a\nb\nc'
           $'a\nb\nc\n'  
           $'Côl 1,Cól 2,Cöl 3\n1,2,æ\n3,4,ç\n5,6,"new\nline"' ; do

     printf '----\n|%s|\n-----\n :: %s\n' "$__" "$( 

     printf '%s' "$__" | 
 gawk -F',' 'BEGIN {   OFS = "\13 |" 
             }     {  ___ += ($NF ~ /"/) - ($1 ~ /"/)
                     ____ += length($0)
                    _____ +=  match($0, /$/) - 1 
 } END { 
     print __ = "", ("gross NR :: ") (_ = NR), 
                    (  "adj NR :: ")((_-= __ = RT == __) - ___), 
                    (   "chars :: ") (_ + ____),
                    (   "bytes :: ") (_ + _____) }' )"
 done 

----
|col1,col2,col3
aa,bb,"c
c"|
-----
 :: 
     |gross NR :: 3
                    |adj NR :: 1
                                 |chars :: 26
                                              |bytes :: 26
----
|col1,col2,col3
aa,bb,"c
c"
|
-----
 :: 
     |gross NR :: 3
                    |adj NR :: 2
                                 |chars :: 27
                                              |bytes :: 27
----
|a
b
c|
-----
 :: 
     |gross NR :: 3
                    |adj NR :: 2
                                 |chars :: 5
                                             |bytes :: 5
----
|a
b
c
|
-----
 :: 
     |gross NR :: 3
                    |adj NR :: 3
                                 |chars :: 6
                                             |bytes :: 6
----
|Côl 1,Cól 2,Cöl 3
1,2,æ
3,4,ç
5,6,"new
line"|
-----
 :: 
     |gross NR :: 5
                    |adj NR :: 3
                                 |chars :: 44
                                              |bytes :: 49

The chars/bytes count matches that from gnu-wc :::

printf '%s' "$__" | gwc -cm

 26      26
 27      27
  5       5
  6       6
 44      49