Best practice to use pyo3-polars with `group_by`

114 views Asked by At

I'm currently in the process of experimenting with pyo3-polars to optimize data aggregation. In a more abstract sense, what I have in mind is the following structure:

df.group_by(c.col1).agg(c.col2.foo())

Here, foo represents a nontrivial function that produces a f64 result. To provide a more concrete explanation, the function foo can be illustrated by the following pseudocode:

agg = 0.0
for i in 0..len(col2):
    for j in i+1..len(col2):
        func(agg, col2[i], col2[j])
return agg

The current implementation I have is designed to return a Series:

#[polars_expr(output_type=Int64)]
fn int_agg(inputs: &[Series]) -> PolarsResult<Series> {
    let ca = inputs[0].i64()?;

    let mut v = Vec::new();
    let mut tot: i64 = 0;

    for i in 0..(ca.len()) { // Fake approximation of actual task
        unsafe {
            tot += ca.value_unchecked(i); 
        }
    }
    v.push(tot);
    let out: Int64Chunked = v.into_iter().map(|i| Some(i + 2)).collect_ca(ca.name());
    Ok(out.into_series())
}

However, I have significant reservations about whether this is the most optimal approach for this particular task. All the examples shown in the repository returns PolarsResult<Series> which forces to create a Series containing a single element. Later in the python session, I had to serially call foo().first() to get the element due to this artifical constraint. Is there a better alternative that directly returns a single element?

Moreover, I am potentially interested in a more comprehensive solution where the aggregation function accepts multiple columns, such as df.group_by("col1").agg(foo("col2", "col3")).

0

There are 0 answers