spark csv datasoruce unable to write leading OR trailing contrl charector

Question

spark csv datasoruce unable to write leading OR trailing contrl charector

126 views Asked by Amiya Mishra At 06 September 2020 at 13:03

val value:String = "\u0001"+ "V1" + "\u0002"
val df  = Seq((value)).toDF("f1")
df.show

Now df is having proper value for field f1. But while writing using spark in build csv format with below code, the ^A, ^B characters are not showing in output.

df.write.format("csv").option("delimiter", "\t").option("codec", "bzip2").save("temp_out")

Here the temp_out output doesnot show any ^A, ^B chraracter for field f1

Looking forward some suggestions.

Original Q&A

There are 1 answers

**ELinda** · Answer 1 · 2020-09-07T00:49:41+00:00

If Spark's save operation is dropping certain characters, you'll notice that when you open the CSV file(s), those bytes are missing. First, take a look at the bytes in value:

value.getBytes()    # Array[Byte] = Array(1, 86, 49, 2)

saveAsTextFile has been around for a while, and is a bit more straightforward. If you can't get the CSV option to work, this is a good workaround.

df.rdd.map(_.mkString("\t")).saveAsTextFile("temp_out")

You'll probably still be able to read the file using the csv method from the reader, without any dropped characters, as below (but you'll want to confirm with your specific setup):

spark.read.option("delimiter", "\t").csv("temp_out/").take(1)(0).getString(0).getBytes()
# result is Array[Byte] = Array(1, 86, 49, 2)

TechQA.

spark csv datasoruce unable to write leading OR trailing contrl charector

There are 1 answers

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in SPARK-CSV

Popular Questions

Popular Tags

Trending Questions