I am involving in a data-mining project and have some problems whiling doing feature engineering. One of my goal is to aggregate data according to the primary key, and to produce new columns. So I write this:
df = df.group_by("case_id").agg(date_exprs(df,df_base))
def date_expr(df, df_base):
# Join df and df_base on 'case_id' column
df = df.join(df_base[['case_id','date_decision']], on="case_id", how="left")
for col in df.columns:
if col[-1] in ("D",):
df = df.with_columns(pl.col(col) - pl.col("date_decision"))
df = df.with_columns(pl.col(col).dt.total_days())
cols = [col for col in df.columns if col[-1] in ("D",)]
# Generate expressions for max, min, mean, mode, and std of date differences
expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
expr_mean = [pl.mean(col).alias(f"mean_{col}") for col in cols]
expr_mode = [pl.mode(col).alias(f"mode_{col}") for col in cols]
expr_std = [pl.std(col).alias(f"std_{col}") for col in cols]
return expr_max + expr_min + expr_mean + expr_mode + expr_std
However, there goes an error: AttributeError: module 'polars' has no attribute 'mode'.
I looked up document of polars on github and found there was no Dataframe.mode() but Series.mode(), which I thought might be the reason of error? I referred to chatGPT, which could not help because these codes with error were just from it.
Besides, here is only an example of dealing with float type. What about string type? Can I also apply your method?
I am looking forward to your kind help!!
In your example it fails because there's no syntactic sugar for
Expr.mode()as it is for aggregate functions (for example,pl.max()is a syntactic sugar forExpr.max(). Themode()is actually not aggregation function but computation one, which means it just calculates the most occuring value(s) within the column.So, given the DataFrame like this:
you can calculate
mode()with the following code:given that, we still can calculate the results you need. I'll simplify you function a bit by using
selectorsandExpr.prefix():note that I've used
Expr.first()to the one of the values formode- as there might be different ones with the same frequency. You can uselistexpressions to specify which one you'd like to get.