I am writing a Spark dataframe in Avro format to HDFS. And I would like to split large Avro files so they would fit into the Hadoop block size and at the same time would not be too small. Are there any dataframe or Hadoop options for that? How can I split the files to be written into smaller ones?
Here is the way I write the data to HDFS:
dataDF.write .format("avro") .option("avroSchema",parseAvroSchemaFromFile("/avro-data-schema.json")) .toString) .save(dataDir)