spark csv datasoruce unable to write leading OR trailing contrl charector

126 views Asked by At
val value:String = "\u0001"+ "V1" + "\u0002"
val df  = Seq((value)).toDF("f1")
df.show

Now df is having proper value for field f1. But while writing using spark in build csv format with below code, the ^A, ^B characters are not showing in output.

df.write.format("csv").option("delimiter", "\t").option("codec", "bzip2").save("temp_out")

Here the temp_out output doesnot show any ^A, ^B chraracter for field f1

Looking forward some suggestions.

1

There are 1 answers

1
ELinda On

If Spark's save operation is dropping certain characters, you'll notice that when you open the CSV file(s), those bytes are missing. First, take a look at the bytes in value:

value.getBytes()    # Array[Byte] = Array(1, 86, 49, 2)

saveAsTextFile has been around for a while, and is a bit more straightforward. If you can't get the CSV option to work, this is a good workaround.

df.rdd.map(_.mkString("\t")).saveAsTextFile("temp_out")

You'll probably still be able to read the file using the csv method from the reader, without any dropped characters, as below (but you'll want to confirm with your specific setup):

spark.read.option("delimiter", "\t").csv("temp_out/").take(1)(0).getString(0).getBytes()
# result is Array[Byte] = Array(1, 86, 49, 2)