I have a parquet dataset partitioned by year, and I am trying to iterate through each year partition to do aggregations on each one, instead of loading the entire dataset into memory at once.
The get_fragments() method typically represents a single file or a portion of a single file within a partition, rather than the entire partition itself. So, to aggregate across an entire partition I tried the following:
import pyarrow.dataset as ds
# Create a FileSystemDataset
dataset = ds.dataset(source_path, format="parquet", partitioning=ds.partitioning(
pa.schema(
[
("year", pa.int16()),
]
)
))
# Define the partitions to aggregate across
partition_key = 2015
# Get the fragments for the specified partition
fragments = [fragment for fragment in dataset.get_fragments() if fragment.partition_expression() == partition_key]
# Initialize a list to collect data
collected_data = []
# Iterate through the fragments to collect data
for fragment in fragments:
data = fragment.to_table()
collected_data.append(data)
# Concatenate collected data into a single table
if collected_data:
aggregated_data = pa.concat_tables(collected_data)
# Perform the aggregation operation on aggregated_data
result = aggregated_data.groupby("your_aggregation_column").sum()
# Continue with your further analysis or processing on the result
However, the FileSystemDataset does not seem to have a way to retrieve all partition keys for this approach to be effective. Is there a better way of doing this?