I wrote a parser for CSV files. I use the SuperCSV library for Java.
In the beginning everything works, but now I'm faced with one problem. I started to receive strange CSV files. I always open them with Notepad++. At this time the file seems to look usual, the encoding in the lower right corner is standard UTF-8, this is OK:
At the same time, there are strange NUL characters (with one letter "L") in the text:
And due to them, the file is not parsed. I started debugging the code and this is what I discovered: first there is a file header with the names of the columns. Then there are 2 lines without this NUL-character. These two lines are parsed normally:
But then the third line contains the NUL character for the first time and from that moment on everything is parsed incorrectly. The library stops recognizing the end of the line (the \n
character) and delimiter character (symbol |
), and tries to parse several lines as one:
// I use this preference:
private static final CsvPreference CSV_PREFERENCE = new CsvPreference.Builder('\u0000', '|', "\n").build();
Well, we get the error accordingly:
2023-10-22T13:18:27,208: ERROR [executor-4] service.ParseServiceImpl - The number of columns to be processed (33) must match the number of CellProcessors (13): check that the number of CellProcessors you have defined matches the expected number of columns being read/written
org.supercsv.exception.SuperCsvException: The number of columns to be processed (33) must match the number of CellProcessors (13): check that the number of CellProcessors you have defined matches the expected number of columns being read/written
at org.supercsv.util.Util.executeCellProcessors(Util.java:78) ~[super-csv-2.1.0.jar:?]
at org.supercsv.io.AbstractCsvReader.executeProcessors(AbstractCsvReader.java:203) ~[super-csv-2.1.0.jar:?]
at org.supercsv.io.CsvBeanReader.read(CsvBeanReader.java:206) ~[super-csv-2.1.0.jar:?]
Tell me please, what is this strange NUL-symbol? Why it appears? Because of this, parsing stops working.
As a rule of thumb, you should always sanitise input files. Remove any special characters that you don't need, or that could be used as an attack vector to compromise security, or those that you know you just can't process or are invalid in your context.
When you read this CSV, choose a range of ascii/UTF-8 characters that you are ready to support, then remove everything else from the file. You need to distrust whoever is creating this CSV file.
If you own the CSV source system, probably just look at how it is creating this file, that might provide you a hint of why it is adding the nul. NUL, actually evaluates to zero unlike null, which is just null.