I'm working on a data lake solution for an IoT framework that does 44Khz data acquisition for a few dozen sensors (~990.000 measures/seconds).
I would like suggestions on how to get an efficient data ingestion solution using Java 11+, Apache Arrow and Apache Parquet .
For data ingestion I am currently using the AvroParquetWriter implementation at https://github.com/apache/parquet-mr and I would like to partition the dataset using two fields: timestamp and sensor name.
I'm not finding examples of creating partitioned datasets in this API.
I can switch from AvroParquetWriter
. Furthermore, the solution does not need to support distributed clustered processing. Just separating the partitions into different directories on the local filesystem is enough.
By the way, I currently use DataFusion to query datasets writed by AvroParquetWriter. Data ingestion performance is satisfactory. My interest in partitioning the data serves the purpose of improving query performance.
Regards