How to get an efficient data ingestion solution using Java, Apache Arrow and Apache Parquet

307 views Asked by João Paraná At 19 June 2022 at 01:05

I'm working on a data lake solution for an IoT framework that does 44Khz data acquisition for a few dozen sensors (~990.000 measures/seconds).

I would like suggestions on how to get an efficient data ingestion solution using Java 11+, Apache Arrow and Apache Parquet .

For data ingestion I am currently using the AvroParquetWriter implementation at https://github.com/apache/parquet-mr and I would like to partition the dataset using two fields: timestamp and sensor name.

I'm not finding examples of creating partitioned datasets in this API.

I can switch from AvroParquetWriter. Furthermore, the solution does not need to support distributed clustered processing. Just separating the partitions into different directories on the local filesystem is enough.

By the way, I currently use DataFusion to query datasets writed by AvroParquetWriter. Data ingestion performance is satisfactory. My interest in partitioning the data serves the purpose of improving query performance.

Regards

Original Q&A

TechQA.

How to get an efficient data ingestion solution using Java, Apache Arrow and Apache Parquet

There are 0 answers

Related Questions in PARQUET

Related Questions in APACHE-ARROW

Related Questions in PARQUET-MR

Related Questions in APACHE-ARROW-DATAFUSION

Popular Questions

Popular Tags

Trending Questions