I am new to hadoop and Big data Technologies. I like to convert a parquet file to avro file and read that data. I search in few forums and it suggested to use AvroParquetReader.
AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
GenericRecord nextRecord = reader.read();
But I am not sure how to include AvroParquetReader. I am not able to import it at all.
I can read this file using spark-shell and may be convert it to some JSON and then that JSON can be converted to avro. But I am looking for a simpler solution.
If you are able to use Spark DataFrames, you will be able to read the parquet files natively in Apache Spark, e.g. (in Python pseudo-code):
To save the files, you can use the
spark-avro
Spark Package. To write the DataFrame out as an avro, it would be something like:df.write.format("com.databricks.spark.avro").save("...")
Don't forget that you will need to include the right version of the
spark-avro
Spark Package with your version of your Spark cluster (e.g. 3.1.0-s2.11 corresponds tospark-avro
package 3.1 using Scala 2.11 which matches the default Spark 2.0 cluster). For more information on how to use the package, please refer to https://spark-packages.org/package/databricks/spark-avro.Some handy references include: