Schema in Avro message

Question

Schema in Avro message

2.9k views Asked by Roshan Fernando At 22 July 2018 at 19:44

I see that the Avro messages have the schema embedded, and then the data in binary format. If multiple messages are sent and new avro files are getting created for every message, is not Schema embedding an overhead? So, does that mean, it is always important for the producer to batch up the messages and then write, so multiple messages writing into one avro file, just carry one schema? On a different note, is there an option to eliminate the schema embedding while serializing using the Generic/SpecificDatum writers?

Original Q&A

There are 2 answers

**BeingSachin** · Answer 1 · 2018-07-27T06:25:24+00:00

I am reading following points from Avro Specs

Apache Avro is a data serialization system.
Avro relies on schemas.
When Avro data is read, the schema used when writing it is always present.
The goal of serialization is to avoid per-value overheads, to make serialization both fast and small.
When Avro data is stored in a file, its schema is stored with it.

You are not supposed to use data serialization system, if you want to write 1 new file for each new message. This is opposed to goal of serialization. In this case, you want to separate metadata and data.

There is no option available to eliminate schema, while writing avro file. It would be against avro specification.

IMO, There should be balance while batching multiple messages into single avro file. Avro files should be ideally broken down to improve i/o efficiency. In case of HDFS, block size would be ideal avro file size.

**sksamuel** · Answer 2 · 2019-05-22T04:08:03+00:00

You are correct, there is an overhead if you write a single record, with the schema. This may seem wasteful, but in some scenarios the ability to construct a record from the data using this schema is more important than the size of the payload.

Also take into account that even with the schema included, the data is encoded in a binary format so is usually smaller than Json anyway.

And finally, frameworks like Kafka can plug into a Schema Registry, where rather than store the schema with each record, they store a pointer to the schema.

TechQA.

Schema in Avro message

There are 2 answers

Related Questions in APACHE

Related Questions in AVRO

Related Questions in SPARK-AVRO

Related Questions in AVRO-TOOLS

Related Questions in AVRO4S

Popular Questions

Trending Questions