We are trying to read a 3 gb file which has multiple new line character in one its column using spark-csv and univocity 1.5.0 parser, but the file is getting split in the multiple column in some row on the basis of newline character. This scenario is occurring in case of large file.
We are using spark 1.6.1 and scala 2.10
Following code i'm using for reading the file :
sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode","FAILFAST")
.option("escape","\"")
.option("qoute"."\"")
.option("parserLib","univocity")
.load("abc.csv")
java.lang.exception: FAILFAST at 01/20/2015 .
Sample File : "A AAAAAAAA","AA999","AA999","AA999","9999-99-99-99.99.99.999999","AAAAAA99","Aaaaa Aaaaaaaa
99/99/9999 - AAA Aaaaaaa Aa: aaaaaaaaa aa A aaaaa, aaaaaaaa aaa aaaaaaa aaaaaaaaaa
Aaa aaaaa aa AAA aaa aaaaaaaaaaa
99/99/9999 Aaaaa aaaaaa - aa aaaaaaaa aaaaaaaaa aaaaaaaa aaaaa aaa aaaaaa aa aaaaaaaaaa aaaaaa aa aaaaaaa aaaaaaaaa.
99/99/9999 Aaa'a aaaaaa a/ aaa aaaaaaa - AAA aaaaaaaaa aaa'a aaaaaaa
99/99/9999 AAA aaaaaa - aaaaaaa aaaaaaaaa
99/99/9999 AAA aaaaaa. Aaa aaaa Aa. Aaaaaa Aa: aaaaaaaaa aaaaaaaa aaaaaa, A aaaaaaa aaaa aaaaaaaaaa, aaaaa aaaaaaa aaaa aaaaaaaaaa (aaaa aaaaaaaaaaaa aaaaaaa). A&Aa aaaaaa aa aaaaaaaaaa aaa aaaa aaaaaa aaaa aaaaa aa aaaaaaaaa, A aaaaaaaa aaaaa aaa aaaaa aaaaaaaa aaaaa aaaa aaaaa aa aaaaaaaaa. Aaa aaaaaa aaaaaa aaaaaa aaaa aaaaaa.
99/99/9999 - aaaaa aaaaaaaa.
99/99/9999 - AAA
99/99/9999 AAA aaaaaa aaaaa aa Aaa 9999 aaaa aaaaaaaaa aaaaaaaaaa - aa A&Aa. Aaaaaaaaaa aaaaa aaaaaa.
99/99/9999 AAA aaaaa aaaaaa - aa aaaaaaa aa aaaaa aaaaaa aa AAA aa AAA aaa aa aaaaaa aaaaaa aaaa-aaaaaaaaaaa. Aa aaaaaaaa aa aaaaaa A&Aa aaaaa aa aaaaa aaaaaaa.
99/99/9999 - Aaaaaa aaaaaa aaaa. Aaaaaaaa aaaa aaaa 99/99/9999 - 99/99/9999
99/99/9999 - aaaaaa aaaaaaa aa AAAA aa: AAAA aaaaa aaaa aaaaaa aaaa aaaa aaaaa aa aaa aaaaaaaaa.
99/99/9999 Aaaaaa a/ aaa aaaaaaa. Aaaa aaaaaaaa aa aaaaaaaaaaaa aa AA.
99/99/9999 Aaaaaa aaaaaa aaaaaa aaaa.
99/99/9999 Aaaaaaaa aaaaaa aa aaaaaa aaaa
99/99/9999 Aaaaaa a/ aaa aaaaaaa aaa'a aaaaaaaaa aaaaaaaaaaa aaaaaaa
99/99/9999 AAA aaaaaa A&Aa aaaaaa aaa aaaaaaaaaaaaaa aaa aaaaa aaaaaa
99/99/9999 AAA aaaaa aaaaaa - aaaaa aaaaaaaaaaaaaaa aaa aaaaaaaaaaaa aa aaaaaaaaaaa. Aaa aaaaaa aaaaaaaaa aaaaaaaa aaaaaaaaa aaaaaaaa aaa aaaa aaaaaa aa aaaaaa aaaaaa aaaaa aaaa aa aaaaaa aaa aaaaaaaa aaaaaaaaa A&Aa aaa aaaaaaaaa, aaaaaaaaa aaaaa aaaaaaaaa.
99/99/9999 AAA aaaaaa aaaaaaa aaaa aaaaaa aa Aaa 9. A&Aa aaaaaa aa aaaaa aaaaa aaaa aaaaaaaa, aaaaaaaaaa aaaa aaaaaaaa aaa aaaa aaaaa aaaaaaaa aaaaaa.
99/99/9999 AAA - aaaaaaaaaaa aaaaaaaaaa.
AAA aaaaaaaaa aaaaaaaaaa aaaaaaa aaaa aaaaaaaaaaaa aaaaa aa aaaa aaaaaa aa aaaaaaa aa aaaaa aaaaaaaaa aaaaa aa aaaaaaaaaaa aa aaaa.
99/99/9999 AAA aaaaa aaaaaa - Aaaaaaaaaaaa aaaaaa aa 99/99/9999 aaaaaa aaaa aaaa aaaaa aaa aaaaaaaaaa a/ aaaaaaaaa aaaaaaaaa aaaaaaaa. Aaa aaaaaaaaaaaa aaaa aa 99/99/9999 aaa aaa aaaaaaaaaaa aaaaaaaaaaaaa aaaaa 99/99/9999 aaaa aaa aaaaaaa aa aaaaaaaaa aaaaaaaa, aaaaaa AAA aa aaaaa aaaaaaaaa aa aa 99. Aa aaa aaaaaaa aa aaaaaaaaa aaaaaaaa, aaa aaaaaaaaaa aaaaaaaa aaaaa aaaa aaaaaaaaaaa aaaa aaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa.
99/99/9999 AAA aaaa aaaaa - AAA aaaaaaa aaaa A&Aa aaaaaaaaaa aa aaa aaaaaaaaaaaa aaaaa aaaa aaaa aaaaaaa aa Aaaaa 9999.
Aaaaaaaaa aaaaaa aa aaaaa aa aa Aaa 9, 9999 aaa aaaaaaa aaaaaaaa aaaaa aaa aaaaaaaa aaaa Aa. Aaaaaaaa aaa aaaa aaaaaa aa aaaaaaa aaaaaa aa A&Aa aaa aaaaaaaa aaaaaa aaaa aaaa. Aaaa aa aaaaaaa aaa aaaaa aa aaaaaaaaaa aaaa aaa aaaaaaaaaa aa aaaaa aa aaaaaaaaaa aaaaa aa aaaaaaaaaaaa.
99/99/9999 Aaaaaaa aaa'a aaaa AA
99/99/9999 - a/a aaaa aa aaaaaaaaaaaa
99/99/9999 Aaaaaaa aaa'a aaaa aaaaaaaaaaaa
99/99/9999 - aaaa aaaaaa aa aaaaaaaaaaaa aaaaaaaa aaa aaa aaaaaaaaaa 99/99/9999 - aaa aaaa aa aaaaaaaaaaaa aaaaaa aaa aaaaaaaaaaaa aaaa aaaa aaaaa aaaa aaaa aaa Aaa 99, 9999 aaaaa aaa aaa aaaaaaaaaa
99/99/9999 - aaaa aaa'a aaaa aaaaaaaaaaaa aaaaaaaa aa aaaa aaaa aaaaaaa aaaaaaaaaaa 99/99/9999 - aaaa aaaaaa aa aaaaaaaaaaaa aa: a/a aaaa aa aa aaaa. Aaaaaaaaa aaaaaaa aaa aaaaaa aaaa aaa aaaaaaaaaaa aaa aaa aaaaaaa aaa aa aaa aaaaaa aa aaaaa. aaa aaaa aaa aaaa aaaaa aaaaa aaaaaaaa aaa aaaa Aaaa aaa aaaa aa Aaaaaaaaa. Aaaa aaa aa aaaaa a/a aaaaa aaaaa. Aaa aaaaaa aa aaaa aaaaa aaaaa.
99/99/9999 - Aaaaa AAA aaaaaa aaaaaaaa. Aaaaaaaaa aaaa aaaa aaaaa aaaaaaa aaaaaaaa aaa Aaaaa Aaaaaaaaaa Aaaaaaaa, aaaaaaa, aaaaa aa a aaa aa aaaa aaaa aaaaaaa aa aaaaaaaa aa aaaaaaa, aaaa aaaaa, aaa aaaaaa, aaaa aa aaaaaaaa, aaaa aa aaaaaaaaaa, aaaaaaa aaaaa aaaaaa. Aaaaa aaa aaaaa aaaa aaaaaaa aaaaaaaa aa aaa aaaaaaaaaa aaaaaaaaaaa aa aaaaaaaa aaaaaaaaa aaaaaaa (aaaaa aa aaaaaaaaaa aa Aaaa 9999). Aaaa aa aaaaa aa aaaa aa aaaaaa aa aaaa. Aaa aaaaaaaa aaa aaaaaaaaaa aa a aaaaaaaaaa aa aaaaaaaa aaaaaaaa, aaaaaa aaaaa aa aaa aaaaaa aaaaaaaaaaa aaaaa aaa aaaaaaaa aa aaa aaaaaaaa aaaaa aa Aaa 9999 aa aaa aaaaaaa aa aaaaaaa aa aaaaaaa aaaaaaaa. Aa Aaa 9999, Aa. Aaaaaaaa aaaaa aaaaaaaaaa aaa aaaaaaaa aaaaaaaa, aaa aa aaaa aaa aaaaaaa aa aaaa aaa aa aaa aaaaaaaa. Aa aa aaaaa aa Aa. A aaaa aaaaaaaaaa aaaaaaaa aaaaaaaaa aaaa aaaa. Aaa A/Aa aaa aaaaa aaaaa Aaa 9999 aaaaa aaaa aaaaaaaa aaaa aa aaaaaaaaaa, aaaa aa aaaaaaaaaaaaa aaa aaaaaaaaa, aaaaaaa, aaaaaaaaa, aaaaaaaaa aaaa, aaaaaaaaaaaaa. Aaaaaaaaa: Aaaaa aaa aaaaaaaa aa aaaaaaa aa aaaa aaaaa, aaaaaaa aaaa aaa aa-aaaaaa aaaaaaa aaaaaaaaa aaa aaaa aa aaaa aaaaaaaa aa aaaaa aaaaaaaaa aaaaaaa aa aaaa-aaaaaaaaaa aaaaaaaaaa, aaa aaaaaaaaa aaaaaaa aaaa. "
Spark's CSV relation is based on its
TextBasedFileFormat
and only looks at the input on a line-by-line basis, so it does not support multi-line records. If you need to support multi-line records you can look at usingwholeTextFiles
instead and manually parsing the input (but ideally this should be done as a pre-processing data cleanup job).