How to get an efficient data ingestion solution using Java, Apache Arrow and Apache Parquet

307 views Asked by At

I'm working on a data lake solution for an IoT framework that does 44Khz data acquisition for a few dozen sensors (~990.000 measures/seconds).

I would like suggestions on how to get an efficient data ingestion solution using Java 11+, Apache Arrow and Apache Parquet .

For data ingestion I am currently using the AvroParquetWriter implementation at https://github.com/apache/parquet-mr and I would like to partition the dataset using two fields: timestamp and sensor name.

I'm not finding examples of creating partitioned datasets in this API.

I can switch from AvroParquetWriter. Furthermore, the solution does not need to support distributed clustered processing. Just separating the partitions into different directories on the local filesystem is enough.

By the way, I currently use DataFusion to query datasets writed by AvroParquetWriter. Data ingestion performance is satisfactory. My interest in partitioning the data serves the purpose of improving query performance.

Regards

0

There are 0 answers