Improve performances (vectorize?) pandas.groupby.aggregate

726 views Asked by At

I'm trying to improve the performances of a pandas.groupby.aggregate operation using a custom aggregating function. I noticed that - correct me if I'm wrong - pandas calls the aggregating function on each block in sequence (I suspect it to be a simple for-loop).

Since pandas is heavily based on numpy, is there a way to speed up the calculation using numpy's vectorization features?

My code

In my code I need to aggregate wind data averaging samples together. While averaging wind-speeds is trivial, averaging wind directions requires a more ad-hoc code (e.g. the average of 1deg and 359deg is 0deg, not 180deg).

What my aggregating function does is:

  1. remove NaNs
  2. return NaN if no other value is present
  3. check if a special flag indicating variable wind direction is present. If it is, return the flag
  4. average the wind directions with a vector-averaging algorithm

The function is:

def meandir(x):
    '''
    Parameters
    ----------
    x : pandas.Series
        pandas series to be averaged

    Returns
    -------
    float
        averaged wind direction
    '''

    # Removes the NaN from the recording
    x = x.dropna()

    # If the record is empty, return NaN
    if len(x)==0:
        return np.nan

    # If the record contains variable samples (990) return variable (990)
    elif np.any(x == 990):
        return 990

    # Otherwise sum the vectors and return the angle
    else:
        angle = np.rad2deg(
                           np.arctan2(
                                   np.sum(np.sin(np.deg2rad(x))),
                                   np.sum(np.cos(np.deg2rad(x)))
                                     )
                          )

        #Wrap angles from (-pi,pi) to (0,360)
        return (angle + 360) % 360

you can test it with

from timeit import repeat
import pandas as pd
import numpy as np

N_samples = int(1e4)
N_nan = N_var = int(0.02 * N_samples)

# Generate random data
data = np.random.rand(N_samples,2) * [30, 360]
data[np.random.choice(N_samples, N_nan), 1] = np.nan
data[np.random.choice(N_samples, N_var), 1] = 990

# Create dataset
df = pd.DataFrame(data, columns=['WindSpeed', 'WindDir'])
df.index = pd.date_range(start='2000-01-01 00:00', periods=N_samples, freq='10min')

# Run groupby + aggregate
grouped = df.groupby(pd.Grouper(freq='H'))   # Data from 14.30 to 15.29 are rounded to 15.00
aggfuns1 = {'WindSpeed': np.mean, 'WindDir':meandir}
aggfuns2 = {'WindSpeed': np.mean, 'WindDir':np.mean}

res = repeat(stmt='grouped.agg(aggfuns1)', globals=globals(), number=1, repeat=10)
print(f'With custom aggregating function {min(res)*1000:.2f} ms')

res = repeat(stmt='grouped.agg(aggfuns2)', globals=globals(), number=1, repeat=10)
print(f'Without custom aggregating function {min(res)*1000:.2f} ms')

which on my PC for N_samples=1e4 outputs:

With custom aggregating function 1500.79 ms
Without custom aggregating function 2.08 ms

with the custom aggregating function being 750 times slower and with N_samples=1e6 outputs:

With custom aggregating function 142967.17 ms
Without custom aggregating function 21.92 ms

with the custom aggregating function being 6500 times slower!

Is there a way to speed up this line of code?

1

There are 1 answers

4
Pierre D On BEST ANSWER

The key is to try to vectorize everything you can on the whole df, and let groupby use only builtin methods.

Here is a way to do that. The trick is to convert the angles to complex numbers, which numpy will happily sum (and groupby too, but groupby will refuse to mean()). So, we convert the angles to complex, sum, then convert back to angles. The same "funny mean" of angles is used as in your code and described on the Wikipedia page you reference.

About the handling of the special value (990), it can be vectorized too: comparing s.groupby(...).count() with .replace(val, nan).groupby(...).count() finds all the groups where there is at least one of those.

Anyway, here goes:

def to_complex(s):
    return np.exp(np.deg2rad(s) * 1j)

def to_angle(s):
    return np.angle(s, deg=True) % 360

def mask_val(s, grouper, val=990):
    return s.groupby(grouper).count() != s.replace(val, np.nan).groupby(grouper).count()

def myagg(df, grouper, val=990, winddir='WindDir'):
    s = df[winddir]
    mask = mask_val(s, grouper, val)
    gb = to_complex(s).groupby(grouper)
    s = gb.sum()
    cnt = gb.count()
    s = to_angle(s) * (cnt / cnt)  # put NaN where all NaNs
    s[mask] = val
    
    # other columns
    agg = df.groupby(grouper).mean()
    agg[winddir] = s

    return agg

Application:

For convenience, I put your example generation into a function gen_example(N_samples).

df = gen_example(50)
myagg(df, pd.Grouper(freq='H'))

Out[ ]:
                     WindSpeed     WindDir
2000-01-01 00:00:00  12.991717  354.120464
2000-01-01 01:00:00  15.743056   60.813629
2000-01-01 02:00:00  14.593927  245.487383
2000-01-01 03:00:00  17.836368  131.493675
2000-01-01 04:00:00  18.987296   27.150359
2000-01-01 05:00:00  16.415725  194.923399
2000-01-01 06:00:00  20.881816  990.000000
2000-01-01 07:00:00  15.033480   44.626018
2000-01-01 08:00:00  16.276834   29.252459

Speed:

df = gen_example(10_000)
%timeit myagg(df, pd.Grouper(freq='H'))

Out[ ]:
6.76 ms ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

df = gen_example(1e6)
%timeit myagg(df, pd.Grouper(freq='H'))

Out[ ]:
189 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Testing:

idx = [0] * 4
grouper = pd.Grouper(level=0)

myagg(pd.DataFrame({'WindDir': [170, 170, 178, 182]}, index=idx), grouper)
      WindDir
0  174.998473

myagg(pd.DataFrame({'WindDir': [330, 359, 1, 40]}, index=idx), grouper)
    WindDir
0  2.251499

myagg(pd.DataFrame({'WindDir': [330, 359, 1, np.nan]}, index=idx), grouper)
      WindDir
0  350.102878

myagg(pd.DataFrame({'WindDir': [np.nan, np.nan, np.nan, np.nan]}, index=idx), grouper)
   WindDir
0      NaN

myagg(pd.DataFrame({'WindDir': [330, 990, 1, np.nan]}, index=idx), grouper)
   WindDir
0    990.0