Partition by column value in rust using Arrow/Datafusion/Polars (like python panda's groupby)?

1.5k views Asked by At

I am looking for an equivalent of the convenient python panda syntax:

#df is a pandas dataframe

for fruit, sub_df in df.groupby('fruits'):
    # Do some stuff with sub_df and fruit

It is basically a groupby, where each group can be accessed as a single dataframe alongside its label (the common value in the grouping column).

I had a look to data fusion but I can't reproduce this behavior without having to first select all the unique values and second execute one select par value which result to re-parsing the whole file multiple times. I had a look to the Polars crate which seamed promising but wan't able to reach my goal either.

How would you do this in similar/better performance as the python code? I am open to any syntax / library / approche that would allow me efficiently to partition the parquet file by values of a fixed column.


Here is a rust sample code using polar as an example of what kind of input I am dealing with:

    let s0 = Series::new("fruits", ["Apple", "Apple", "Pear", "Pear", "Pear", "Pear"].as_ref());
    let s1 = Series::new("maturity", ["A", "B", "A", "C", "A", "D"].as_ref());
    let s1 = Series::new("N", [1, 2, 2, 4, 2, 8].as_ref());
    // create a new DataFrame
    let df = DataFrame::new(vec![s0, s1, s2]).unwrap();

    // I would like to loop on all fruits values, each time with a dataframe containing only the records with this fruit.
2

There are 2 answers

2
loremdipso On

You can do something like:

    let columns = df.columns();
    if let Ok(grouped) = df.groupby("fruits") {
        let sub_df = grouped.select(columns).agg_list()?;
        dbg!(sub_df);
    }

And that'll technically leave you with a dataframe. The problem then is that the dataframe's columns are arrays with all the values for each fruit, which might not be what you want.

+---------+--------------------------------------+-----------------------------+-----------------+
| fruits  | fruits_agg_list                      | maturity_agg_list           | N_agg_list      |
| ---     | ---                                  | ---                         | ---             |
| str     | list [str]                           | list [str]                  | list [i32]      |
+=========+======================================+=============================+=================+
| "Apple" | "[\"Apple\", \"Apple\"]"             | "[\"A\", \"B\"]"            | "[1, 2]"        |
+---------+--------------------------------------+-----------------------------+-----------------+
| "Pear"  | "[\"Pear\", \"Pear\", ... \"Pear\"]" | "[\"A\", \"C\", ... \"D\"]" | "[2, 4, ... 8]" |
+---------+--------------------------------------+-----------------------------+-----------------+
4
ritchie46 On

If you activate the partition_by feature, polars exposes

DataFrame::partition_by and DataFrame::partition_by_stable.

use polars::prelude::*;

fn main() -> Result<()> {
    let partitioned = df! {
        "fruits" => &["Apple", "Apple", "Pear", "Pear", "Pear", "Pear"],
        "maturity" => &["A", "B", "A", "C", "A", "D"],
        "N" => &[1, 2, 2, 4, 2, 8]

    }?.partition_by(["fruits"])?;

    dbg!(partitioned);
    Ok(())
}

Running this prints:

[src/main.rs:17] partitioned = [
    shape: (4, 3)
    ┌────────┬──────────┬─────┐
    │ fruits ┆ maturity ┆ N   │
    │ ---    ┆ ---      ┆ --- │
    │ str    ┆ str      ┆ i32 │
    ╞════════╪══════════╪═════╡
    │ Pear   ┆ A        ┆ 2   │
    ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
    │ Pear   ┆ C        ┆ 4   │
    ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
    │ Pear   ┆ A        ┆ 2   │
    ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
    │ Pear   ┆ D        ┆ 8   │
    └────────┴──────────┴─────┘,
    shape: (2, 3)
    ┌────────┬──────────┬─────┐
    │ fruits ┆ maturity ┆ N   │
    │ ---    ┆ ---      ┆ --- │
    │ str    ┆ str      ┆ i32 │
    ╞════════╪══════════╪═════╡
    │ Apple  ┆ A        ┆ 1   │
    ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
    │ Apple  ┆ B        ┆ 2   │
    └────────┴──────────┴─────┘,
]