Spark with Avro, Kryo and Parquet

5.4k views Asked by At

I'm struggling to understand what exactly Avro, Kryo and Parquet do in the context of Spark. They all are related to serialization but I've seen them used together so they can't be doing the same thing.

Parquet describes its self as a columnar storage format and I kind of get that but when I'm saving a parquet file can Arvo or Kryo have anything to do with it? Or are they only relevant during the spark job, ie. for sending objects over the network during a shuffle or spilling to disk? How do Arvo and Kryo differ and what happens when you use them together?

2

There are 2 answers

2
Dean Wampler On

This very good blog post explains the details for everything but Kryo.

http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/

Kryo would be used for fast serialization not involving permanent storage, such as shuffle data and cached data, in memory or on disk as temp files.

0
kostya On

Parquet works very well when you need to read only a few columns when querying your data. However if your schema has lots of columns (30+) and in your queries/jobs you need to read all of them then record based formats (like AVRO) will work better/faster.

Another limitation of Parquet is that it is essentially write-once format. So usually you need to collect data in some staging area and write it to a parquet file once a day (for example).

This is where you might want to use AVRO. E.g. you can collect AVRO-encoded records in a Kafka topic or local files and have a batch job that converts all of them to Parquet file at the end of the day. This is fairly easy to implement thanks to parquet-avro library that provides tools to convert between AVRO and Parquet formats automatically.

And of course you can use AVRO outside of Spark/BigData. It is fairly good serialization format similar to Google Protobuf or Apache Thrift.