I'm currently in the process of experimenting with pyo3-polars to optimize data aggregation. In a more abstract sense, what I have in mind is the following structure:
df.group_by(c.col1).agg(c.col2.foo())
Here, foo
represents a nontrivial function that produces a f64
result. To provide a more concrete explanation, the function foo
can be illustrated by the following pseudocode:
agg = 0.0
for i in 0..len(col2):
for j in i+1..len(col2):
func(agg, col2[i], col2[j])
return agg
The current implementation I have is designed to return a Series
:
#[polars_expr(output_type=Int64)]
fn int_agg(inputs: &[Series]) -> PolarsResult<Series> {
let ca = inputs[0].i64()?;
let mut v = Vec::new();
let mut tot: i64 = 0;
for i in 0..(ca.len()) { // Fake approximation of actual task
unsafe {
tot += ca.value_unchecked(i);
}
}
v.push(tot);
let out: Int64Chunked = v.into_iter().map(|i| Some(i + 2)).collect_ca(ca.name());
Ok(out.into_series())
}
However, I have significant reservations about whether this is the most optimal approach for this particular task. All the examples shown in the repository returns PolarsResult<Series>
which forces to create a Series containing a single element. Later in the python session, I had to serially call foo().first()
to get the element due to this artifical constraint. Is there a better alternative that directly returns a single element?
Moreover, I am potentially interested in a more comprehensive solution where the aggregation function accepts multiple columns, such as df.group_by("col1").agg(foo("col2", "col3"))
.