I need to read a csv-file, but the file contains broken lines. Here's an example:
"name","address","link"
"7eleven","city, street, 1",https://somelink/1 \\the good line
Baby-Gym,"city, street, 2\",https://somelink/2 \\the broken line because it has \", sequence
In this example, the second line of the CSV file is broken, as the value for "address" contains \", sequence.
I cannot change the CSV file. I just want to read and ignore these broken lines. However, I am experiencing unexpected behavior with the com.opencsv library when it reads more than one line (approximately 5k lines) using csvReader.readNext().
Here is the code I am using to read the CSV file:
try (Reader reader = new BufferedReader(new InputStreamReader(is))) {
CSVParser parser = new CSVParserBuilder()
.withSeparator(',')
.withQuoteChar('"')
.build();
try (CSVReader csvReader= new CSVReaderBuilder(reader)
.withSkipLines(1)
.withCSVParser(parser)
.build()) {
Set<info> infoList = new HashSet<>();
String[] infoParts;
while ((infoParts = csvReader.readNext()) != null) {
// code
}
}
}
How can I read line by line with OpenCSV while avoiding the need to ignore 5k lines due to the presence of a single broken line?
I can't find information anywhere on how to solve this problem.
I tried using new CSVReaderBuilder(reader).withMultilineLimit(1), but it just throws exceptions...
I looked at the CSVParser and CSVReader documentation but I didn't find the necessary settings. Please help me.
There is no really correct way of doing this, because quoted CSV fields are allowed to contain newlines. If you can require that the only newlines present are actually record separators, then you can pre-split your data into lines before parsing.
This means creating a new
CSVParserfor each row, which may be expensive -- I haven't benchmarked this approach: