How to iterate through all partitions in a PyArrow FileSystemDataset for aggregations

140 views Asked by At

I have a parquet dataset partitioned by year, and I am trying to iterate through each year partition to do aggregations on each one, instead of loading the entire dataset into memory at once.

The get_fragments() method typically represents a single file or a portion of a single file within a partition, rather than the entire partition itself. So, to aggregate across an entire partition I tried the following:

import pyarrow.dataset as ds

# Create a FileSystemDataset
dataset = ds.dataset(source_path, format="parquet", partitioning=ds.partitioning(
            pa.schema(
                [
                    ("year", pa.int16()),
                ]
            )
        ))

# Define the partitions to aggregate across
partition_key = 2015

# Get the fragments for the specified partition
fragments = [fragment for fragment in dataset.get_fragments() if fragment.partition_expression() == partition_key]

# Initialize a list to collect data
collected_data = []

# Iterate through the fragments to collect data
for fragment in fragments:
    data = fragment.to_table()
    collected_data.append(data)

# Concatenate collected data into a single table
if collected_data:
    aggregated_data = pa.concat_tables(collected_data)

    # Perform the aggregation operation on aggregated_data
    result = aggregated_data.groupby("your_aggregation_column").sum()

    # Continue with your further analysis or processing on the result

However, the FileSystemDataset does not seem to have a way to retrieve all partition keys for this approach to be effective. Is there a better way of doing this?

0

There are 0 answers