Reading parquet file as is using spark

281 views Asked by At

I want to read parquet file as is using spark and process each file's content one by one. I was trying to achieve it using following approach

    spark.read
      .option("wholetext", "true")
      .option("compression", "none")
      .textFile("/test/part-00001-d3a107e9-ead6-45f0-bccf-fadcecae45bb-c000.zstd.parquet")

Also tried many different similar approaches to this, but Spark seems to be modifying the file content somehow, because of absence of some option probably added when reading it.

My final goal is to load those files in Clickhouse using okhttp client in Scala. The file I am trying to load is not corrupted and Clickhouse successfully processes it, when not used in Spark. However, when I try to use spark with the following approach Clickhouse responds std::exception. Code: 1001, type: parquet::ParquetException, e.what() = Couldn't deserialize thrift: TProtocolException: Invalid data When I try to print out the content of whatever I read from the file I see this

Europe/Moscoworg.apache.spark.version3.4.1)org.apache.spark.sql.parquet.row.metadata�{"type":"struct","fields":[{"name":"field1","type":"integer","nullable":true,"metadata":{}},{"name":"field2","type":{"type":"array","elementType":"integer","containsNull":true},"nullable":true,"metadata":{}},{"name":"field3","type":{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}},{"name":"y","type":"integer","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}org.apache.spark.legacyDateTimeJparquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)L^PAR1

Which doesn't look like it has any data within it, and appears to only have metadata.

My question is how to read the parquet file as is in binary in string format using Spark.

1

There are 1 answers

2
mamonu On

What if you try using spark.sparkContext.binaryFiles (BinaryFiles code ) ?

val files: RDD[(String, PortableDataStream)] = spark.sparkContext.binaryFiles("/path/to/parquet/files/")
files.foreach { case (path, stream) =>
  // Use the `stream` PortableDataStream to get InputStream and read the binary content
  // Process or write the binary content to Clickhouse
}

Another way would be to use Javas nio API :

import java.nio.file.{Files, Paths}

val bytes = Files.readAllBytes(Paths.get("/path/to/parquet/file"))