I want to read parquet file as is using spark and process each file's content one by one. I was trying to achieve it using following approach
spark.read
.option("wholetext", "true")
.option("compression", "none")
.textFile("/test/part-00001-d3a107e9-ead6-45f0-bccf-fadcecae45bb-c000.zstd.parquet")
Also tried many different similar approaches to this, but Spark seems to be modifying the file content somehow, because of absence of some option probably added when reading it.
My final goal is to load those files in Clickhouse using okhttp client in Scala. The file I am trying to load is not corrupted and Clickhouse successfully processes it, when not used in Spark.
However, when I try to use spark with the following approach Clickhouse responds std::exception. Code: 1001, type: parquet::ParquetException, e.what() = Couldn't deserialize thrift: TProtocolException: Invalid data
When I try to print out the content of whatever I read from the file I see this
Europe/Moscoworg.apache.spark.version3.4.1)org.apache.spark.sql.parquet.row.metadata�{"type":"struct","fields":[{"name":"field1","type":"integer","nullable":true,"metadata":{}},{"name":"field2","type":{"type":"array","elementType":"integer","containsNull":true},"nullable":true,"metadata":{}},{"name":"field3","type":{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}},{"name":"y","type":"integer","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}org.apache.spark.legacyDateTimeJparquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)L^PAR1
Which doesn't look like it has any data within it, and appears to only have metadata.
My question is how to read the parquet file as is in binary in string format using Spark.
What if you try using spark.sparkContext.binaryFiles (BinaryFiles code ) ?
Another way would be to use Javas nio API :