Faster way of computing the mean with pandas groupy + apply and condensing groups

61 views Asked by At

I want to groupby two values and if the group contains more than one element, return only the first row of the group with the value replaced by the mean for the group. If there is only one element, I want to return directly. My code looks like this:

final = df.groupby(["a", "b"]).apply(condense).drop(['a', 'b'], axis=1).reset_index()

def condense(df):
    if df.shape[0] > 1:
        mean = df["c"].mean()
        record = df.iloc[[0]]
        record["c"] = mean
        return(record)
    else:
        return(df)

And the df looks something like this:

a      b     c   d
"f"   "e"    2   True
"f"   "e"    3   False
"c"   "a"    1   True

As the data frame is quite large, I have 73800 groups and the computation of the whole groupby + apply takes about a minute. This is far too long. Is there a way to make it run faster?

1

There are 1 answers

1
jezrael On BEST ANSWER

I think mean of one value is same like mean of multiple values, so you can solution simplify by GroupBy.agg with mean for column c and all another values aggregate by first:

d = dict.fromkeys(df.columns.difference(['a','b']), 'first')
d['c'] = 'mean'
print (d)
{'c': 'mean', 'd': 'first'}

df = df.groupby(["a", "b"], as_index=False).agg(d)
print (df)
   a  b    c     d
0  c  a  1.0  True
1  f  e  2.5  True