Row-wise aggregation of Pandas dataframe

44 views Asked by At

What is the most pythonic way to write a function which does a row-wise aggregation (sum, min, max, mean etc) of a specified set of columns (column names in a list) of a pandas dataframe while skipping NaN values?

import pandas as pd
import numpy as np

df = pd.DataFrame({"col1": [1, np.NaN, 1],
                   "col2": [2, 2, np.NaN]})

def aggregate_rows(df, column_list, func):
    # Check if the specified columns exist in the DataFrame
    missing_columns = [col for col in column_list if col not in df.columns]
    if missing_columns:
        raise ValueError(f"Columns not found in DataFrame: {missing_columns}")

    # Check if func is callable
    if not callable(func):
        raise ValueError("The provided function is not callable.")

    # Sum the specified columns
    agg_series = df[column_list].apply(lambda row: func(row.dropna()), axis=1)

    return agg_series

df["sum"] = aggregate_rows(df, ["col1", "col2"], sum)
df["max"] = aggregate_rows(df, ["col1", "col2"], max)
df["mean"] = aggregate_rows(df, ["col1", "col2"], lambda x: x.mean())
print(df)

results in (as expected):

   col1  col2  sum  max  mean
0   1.0   2.0  3.0  2.0   1.5
1   NaN   2.0  2.0  2.0   2.0
2   1.0   NaN  1.0  1.0   1.0

but a row with only NaN values,

df = pd.DataFrame({"col1": [1, np.NaN, 1, np.NaN],
                   "col2": [2, 2, np.NaN, np.NaN]})

results in:

ValueError: max() arg is an empty sequence

What is the best way to fix this?

3

There are 3 answers

0
Andrej Kesely On

You can try to use numpy.sum/numpy.max/numpy.mean instead of Python's builtins:

df["sum"] = aggregate_rows(df, ["col1", "col2"], np.sum)
df["max"] = aggregate_rows(df, ["col1", "col2"], np.max)
df["mean"] = aggregate_rows(df, ["col1", "col2"], np.mean)

print(df)

Prints:

   col1  col2  sum  max  mean
0   1.0   2.0  3.0  2.0   1.5
1   NaN   2.0  2.0  2.0   2.0
2   1.0   NaN  1.0  1.0   1.0
3   NaN   NaN  0.0  NaN   NaN
0
ouroboros1 On

You can use df.agg over axis=1, and add the result to your original df via df.join:

out = df.join(df.agg(['sum', 'max', 'mean'], axis=1))

out

   col1  col2  sum  max  mean
0   1.0   2.0  3.0  2.0   1.5
1   NaN   2.0  2.0  2.0   2.0
2   1.0   NaN  1.0  1.0   1.0
3   NaN   NaN  0.0  NaN   NaN
0
mozway On

If you want to ignore rows with just NaNs, just drop them before aggregation with dropna:

cols = ['col1', 'col2']
agg = ['sum', 'max', 'mean']

df[agg] = df[cols].dropna(how='all').agg(agg, axis=1)

More robust variant using boolean indexing if you can have duplicated indices:

cols = ['col1', 'col2']
agg = ['sum', 'max', 'mean']

m = df[cols].notna().any(axis=1)

df.loc[m, agg] = df.loc[m, cols].agg(agg, axis=1)

NB. You can also use custom names for the columns in which to assign the output: df.loc[m, ['A', 'B', 'C']] = ....

Output:

   col1  col2  sum  max  mean
0   1.0   2.0  3.0  2.0   1.5
1   NaN   2.0  2.0  2.0   2.0
2   1.0   NaN  1.0  1.0   1.0
3   NaN   NaN  NaN  NaN   NaN