I'm using a Pandas DataFrame to do a row-wise t-test as per this example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.log2(np.randn(1000, 4), columns=["a", "b", "c", "d"]).dropna()
Now, suppose I have "a" and "b" as one group, and "c" and "d" at the other, I'm performing the t-test row-wise. This is fairly trivial with pandas, using apply
with axis=1
. However, I can either return a DataFrame of the same shape if my function doesn't aggregate, or a Series if it aggregates.
Normally I would just output the p-value (so, aggregation) but I would like to generate an additional value based on other calculations (in other words, return two values). I can of course do two runs, aggregating the p-values first, then doing the other work, but I was wondering if there is a more efficient way to do so as the data is reasonably large.
As an example of the calculation, a hypothetical function would be:
from scipy.stats import ttest_ind
def t_test_and_mean(series, first, second):
first_group = series[first]
second_group = series[second]
_, pvalue = ttest_ind(first_group, second_group)
mean_ratio = second_group.mean() / first_group.mean()
return (pvalue, mean_ratio)
Then invoked with
df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1)
Of course in this case it returns a single Series with the two tuples as value.
Instead, my expected output would be a DataFrame with two columns, one for the first result, and one for the second. Is this possible or I have to do two runs for the two calculations, then merge them together?
Returning a Series, rather than tuple, should produce a new multi-column DataFrame. For example,