Amazon S3 Select Issue : not supporting line break occurring inside fields

2.2k views Asked by At

I am trying to use Amazon S3 Select to read records from a CSV file and if the field contains a line break(\n), then the record is not being parsed as a single record. Also, the line break inside the field has been properly escaped by double quotes as per standard CSV format.

For example, the below CSV file

Id,Name,Age,FamilyName,Place

p1,Albert Einstein,25,"Einstein

Cambridge",Cambridge

p2,Thomas Edison,30,"Edison

Cardiff",Cardiff

is being parsed as

Line 1 : Id,Name,Age,FamilyName,Place

Line 2 : p1,Albert Einstein,25,"Einstein

Line 3 : Cambridge",Cambridge

Line 4 : p2,Thomas Edison,30,"Edison

Line 5 : Cardiff",Cardiff

Ideally it should have been parsed as given below:

Line 1:

Id,Name,Age,FamilyName,Place

Line 2:

p1,Albert Einstein,25,"Einstein

Cambridge",Cambridge

Line 3:

p2,Thomas Edison,30,"Edison

Cardiff",Cardiff

I'm setting AllowQuotedRecordDelimiter to TRUE in the SelectObjectContentRequest as given in their documentation. It's still not working.

Does anyone know if Amazon S3 Select supports line break inside fields as described in the case mentioned above? Or any other parameters I need to change or set to make this work?

1

There are 1 answers

0
Aaron_H On

This is being parsed / printed correctly. The confusion lies in that the literal newline is being printed in the output. You can test this if you run the following expression on the given csv:

SELECT COUNT(*) from s3Object s

Output: 2

Note that if you specify only the third column, you get only the correct value:

SELECT s._3 frin s3Object s

You get only the parts of each line that enclose said field:

"Einstein
Cambridge"
"Edison
Cardiff"

What's happening is the character in the field is the same as the default CSVOutput.RecordDelimiter value (\n) which is causing a clash. If you want to separate each field in a different way, you could add the the following to the CSVOutput part of the OutputSerialization:

"RecordDelimiter": "\r\n"

or use some other type of 1-2 length character sequence in place of \r\n