Returning multiple values from pandas apply on a DataFrame

50.9k views Asked by At

I'm using a Pandas DataFrame to do a row-wise t-test as per this example:

import numpy as np
import pandas as pd
  
df = pd.DataFrame(np.log2(np.randn(1000, 4), columns=["a", "b", "c", "d"]).dropna()

Now, suppose I have "a" and "b" as one group, and "c" and "d" at the other, I'm performing the t-test row-wise. This is fairly trivial with pandas, using apply with axis=1. However, I can either return a DataFrame of the same shape if my function doesn't aggregate, or a Series if it aggregates.

Normally I would just output the p-value (so, aggregation) but I would like to generate an additional value based on other calculations (in other words, return two values). I can of course do two runs, aggregating the p-values first, then doing the other work, but I was wondering if there is a more efficient way to do so as the data is reasonably large.

As an example of the calculation, a hypothetical function would be:

from scipy.stats import ttest_ind

def t_test_and_mean(series, first, second):
    first_group = series[first]
    second_group = series[second]
    _, pvalue = ttest_ind(first_group, second_group)

    mean_ratio = second_group.mean() / first_group.mean()
    
    return (pvalue, mean_ratio)

Then invoked with

df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1)

Of course in this case it returns a single Series with the two tuples as value.

Instead, my expected output would be a DataFrame with two columns, one for the first result, and one for the second. Is this possible or I have to do two runs for the two calculations, then merge them together?

2

There are 2 answers

6
Garrett On

Returning a Series, rather than tuple, should produce a new multi-column DataFrame. For example,

return pandas.Series({'pvalue': pvalue, 'mean_ratio': mean_ratio})
0
cottontail On

"Better" solutions for apply(axis=1)

apply has result_type= parameter that can expand a result into a dataframe. For OP's case, that would look like the following (note that the original function doesn't need to be touched):

df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], result_type='expand', axis=1)

Casting each row into a pandas Series is painfully slow (for a frame with 10k rows, it takes 20 seconds). A faster solution is to convert the values returned from an apply call into a list and cast into a DataFrame once (or assign back to the dataframe). Or use a Python loop for an even faster solution (how that can be written is shown at the end of this post).

For the case in the OP, that would look like the following (again, the original function shouldn't be altered).

df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).values.tolist()

# or create a new frame
new_df = pd.DataFrame(df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).values.tolist(), index=df.index, columns=['pvalue', 'mean_ratio'])

Expanding groupby.apply

The same can done for functions called via groupby.apply as well. Simply convert the result into a list and cast into a new dataframe. For example, if we call the function in the OP in a groupby call, the result could be fixed up as follows:

# sample data
df = pd.DataFrame(np.random.randn(1000, 4), columns=["a", "b", "c", "d"])
df['grouper'] = list(range(10))*100

# perform groupby
x = df.groupby('grouper').apply(t_test_and_mean, first="a", second="c")
# convert to the groupby result into a list and cast into a dataframe
# in order to not lose any information, assign index and axis name appropriately
agg_df = pd.DataFrame(x.tolist(), index=x.index, columns=['pvalue', 'mean_ratio']).rename_axis(x.index.name)


Other solutions for apply(axis=1)

Another solution (slower than converting to list) is to chain an .apply(pd.Series) call:

df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).apply(pd.Series)

Since .apply(axis=1) is syntactic sugar for a Python loop, the biggest speed up would be to convert the frame into a list, re-write the function into one that works with Python lists and just use a list comprehension (this would speed up the process about 6 times). For the example in the OP, that would look like:

def t_test_and_mean_on_lists(first_group, second_group):
    _, pvalue = ttest_ind(first_group, second_group)
    mean_ratio = np.mean(second_group) / np.mean(first_group)
    return (pvalue, mean_ratio)

df[['pvalue', 'mean_ratio']] = [t_test_and_mean_on_lists(ab, cd) for ab, cd in zip(df[['a','b']].values.tolist(), df[['c','d']].values.tolist())]

Decorating this function with numba.njit would make it even faster but that's outside the scope of this question.