I see that the Avro messages have the schema embedded, and then the data in binary format. If multiple messages are sent and new avro files are getting created for every message, is not Schema embedding an overhead? So, does that mean, it is always important for the producer to batch up the messages and then write, so multiple messages writing into one avro file, just carry one schema? On a different note, is there an option to eliminate the schema embedding while serializing using the Generic/SpecificDatum writers?
Schema in Avro message
2.9k views Asked by Roshan Fernando At
2
There are 2 answers
0
sksamuel
On
You are correct, there is an overhead if you write a single record, with the schema. This may seem wasteful, but in some scenarios the ability to construct a record from the data using this schema is more important than the size of the payload.
Also take into account that even with the schema included, the data is encoded in a binary format so is usually smaller than Json anyway.
And finally, frameworks like Kafka can plug into a Schema Registry, where rather than store the schema with each record, they store a pointer to the schema.
Related Questions in APACHE
- .htaccess redirect 403 error files to 404 error document
- RestApi server code is not workinng
- Convert Apache VirtualHost to nginx Server Block for Dynamic Subdomains
- Looking the Method that MANUALLY INSTALL PHP on OSX Yosemite
- Premature end of script on VPS
- Rasterization with Javascript looks different on Apache server
- Vagrant - Ansible error installing Apache
- Can't use subdomain in Chrome using Apache (XAMPP)
- Django webapp (on an Apache2 server) hangs indefintely when importing nltk in views.py
- Redirect keystone app to sub directory using htaccess
- How can I integrate Solr5.1.0 with Nutch1.10
- Disconnect Client connected to cgi application
- Solr ping taking time during full import
- How to redirect an incoming request to specific serverName to different server in apache2?
- What is the correct way to link Django Flatpages?
Related Questions in AVRO
- pcap to Avro on Hadoop
- mapreduce job not setting compression codec correctly
- Unable to correctly load twitter avro data into hive table
- Error quering Avro Data using PIG, Utf8 cannot be cast to java.lang.String
- MapReduce Avro Output is Creating Text File Instead
- Lily with Morphline and HBase
- Oozie worflow with avro - output is a corrupt avro file
- Error while I launch spark-submit because avro
- Storm-jms Spout collecting Avro messages and sending down stream?
- How to test reducer with avro params in MRUnit?
- Generic Data Record Cannot be cast to Avro
- converting avro record to string and back
- Creating RDD from sequence of GenericRecord in spark will change field values in generic record
- Why does an optional flume channel cause a non-optional flume channel to have problems?
- With bottledwater-pg, how to read data by a Python consumer?
Related Questions in SPARK-AVRO
- Handling schema changes in running Spark Streaming application
- Not able to convert the byte[] to string in scala
- How to convert parquet file to Avro file?
- How to Read a large avro file
- How does avro partition pruning work internally?
- what is the default and type fields used for in avro schemas?
- Unable to convert avro data back to spark dataframe
- spark-sql from_avro does not exist
- Spark PySpark StructType StructField json to avro
- How to encode structs into Avro record in Spark?
- Pyspark + Avro type conversion problems after transformation
- How to call avro SchemaConverters in Pyspark
- How to Deseralize Avro response getting from Datastream Scala + apache Flink
- Convert dataset to dataframe from an avro file
- How to write Avro Objects to Parquet with partitions in Java ? How to append data to the same parquet?
Related Questions in AVRO-TOOLS
- Cannot print Kafka Avro decoded message
- How to convert "org.apache.avro.generic.GenericRecord" to "java.util.Map"?
- Concat Avro files in Google Cloud Storage
- Access custom attributes in avro schema from Java class
- Avro SchemaStore & versioning without a central registry
- Avrogen strips out logicalType properties when generating C# classes
- Converting an AVDL file into something Apache's avro python package can parse
- avro maven plugin can not generate code for map schema
- How to convert an Avro message deserialized with an older version of the schema into the newer, compiled schema?
- avro-tools complain: Failed to load class “org.slf4j.impl.StaticLoggerBinder” - Or - How to properly give a classpath in java
- AVRO schema for generic type - Java
- How can I generate a single .avro file for large flat file with 30MB+ data
- Avro Schema deserialization Index out of bound exception error - schema resolution
- Serialize a JSON String to Avro Object that has Union Fields
- How to generate Java classes from Avro schemas as part of a Gradle build?
Related Questions in AVRO4S
- Scala, how to simplify or reuse side-effecting pattern matching logic?
- How can I migrate from avro4s 3.0.4 to 4.0.0-RC2?
- Flink throwing com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException
- BigDecimal serialization with unknown scale and precision
- Avro4S : Could not find implicit value for parameter builder
- Avro4s generic toByteArray / fromByteArray
- avro4s can not deserialize AnyRef
- AVRO schema with optional record
- could not find implicit value for FromRecord
- Avro4s, how to serialise a map with custom key type?
- Do I really need avro4s when using kafka schema registry?
- Data Conversion for a field using AVRO
- Avro4S: Error when Deserializing Abstract Data Types (Could not find implicit value for evidence parameter of type FromRecord)
- csv to avro without apache spark in scala
- Avro4s: The argument types of an anonymous function must be fully known. (SLS 8.5)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
I am reading following points from Avro Specs
You are not supposed to use data serialization system, if you want to write 1 new file for each new message. This is opposed to goal of serialization. In this case, you want to separate metadata and data.
There is no option available to eliminate schema, while writing avro file. It would be against avro specification.
IMO, There should be balance while batching multiple messages into single avro file. Avro files should be ideally broken down to improve i/o efficiency. In case of HDFS, block size would be ideal avro file size.