I am looking for an equivalent of the convenient python panda syntax:
#df is a pandas dataframe
for fruit, sub_df in df.groupby('fruits'):
# Do some stuff with sub_df and fruit
It is basically a groupby, where each group can be accessed as a single dataframe alongside its label (the common value in the grouping column).
I had a look to data fusion but I can't reproduce this behavior without having to first select all the unique values and second execute one select par value which result to re-parsing the whole file multiple times. I had a look to the Polars crate which seamed promising but wan't able to reach my goal either.
How would you do this in similar/better performance as the python code? I am open to any syntax / library / approche that would allow me efficiently to partition the parquet file by values of a fixed column.
Here is a rust sample code using polar as an example of what kind of input I am dealing with:
let s0 = Series::new("fruits", ["Apple", "Apple", "Pear", "Pear", "Pear", "Pear"].as_ref());
let s1 = Series::new("maturity", ["A", "B", "A", "C", "A", "D"].as_ref());
let s1 = Series::new("N", [1, 2, 2, 4, 2, 8].as_ref());
// create a new DataFrame
let df = DataFrame::new(vec![s0, s1, s2]).unwrap();
// I would like to loop on all fruits values, each time with a dataframe containing only the records with this fruit.
You can do something like:
And that'll technically leave you with a dataframe. The problem then is that the dataframe's columns are arrays with all the values for each fruit, which might not be what you want.