How to convert parquet file to Avro file?

6.5k views Asked by At

I am new to hadoop and Big data Technologies. I like to convert a parquet file to avro file and read that data. I search in few forums and it suggested to use AvroParquetReader.

AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
GenericRecord nextRecord = reader.read();

But I am not sure how to include AvroParquetReader. I am not able to import it at all.

I can read this file using spark-shell and may be convert it to some JSON and then that JSON can be converted to avro. But I am looking for a simpler solution.

1

There are 1 answers

5
Denny Lee On

If you are able to use Spark DataFrames, you will be able to read the parquet files natively in Apache Spark, e.g. (in Python pseudo-code):

df = spark.read.parquet(...) 

To save the files, you can use the spark-avro Spark Package. To write the DataFrame out as an avro, it would be something like:

df.write.format("com.databricks.spark.avro").save("...")

Don't forget that you will need to include the right version of the spark-avro Spark Package with your version of your Spark cluster (e.g. 3.1.0-s2.11 corresponds to spark-avro package 3.1 using Scala 2.11 which matches the default Spark 2.0 cluster). For more information on how to use the package, please refer to https://spark-packages.org/package/databricks/spark-avro.

Some handy references include:

  1. Spark SQL Programming Guide
  2. spark-avro Spark Package.