I'm trying to improve the performances of a pandas.groupby.aggregate
operation using a custom aggregating function. I noticed that - correct me if I'm wrong - pandas
calls the aggregating function on each block in sequence (I suspect it to be a simple for
-loop).
Since pandas
is heavily based on numpy
, is there a way to speed up the calculation using numpy
's vectorization features?
My code
In my code I need to aggregate wind data averaging samples together. While averaging wind-speeds is trivial, averaging wind directions requires a more ad-hoc code (e.g. the average of 1deg and 359deg is 0deg, not 180deg).
What my aggregating function does is:
- remove NaNs
- return NaN if no other value is present
- check if a special flag indicating variable wind direction is present. If it is, return the flag
- average the wind directions with a vector-averaging algorithm
The function is:
def meandir(x):
'''
Parameters
----------
x : pandas.Series
pandas series to be averaged
Returns
-------
float
averaged wind direction
'''
# Removes the NaN from the recording
x = x.dropna()
# If the record is empty, return NaN
if len(x)==0:
return np.nan
# If the record contains variable samples (990) return variable (990)
elif np.any(x == 990):
return 990
# Otherwise sum the vectors and return the angle
else:
angle = np.rad2deg(
np.arctan2(
np.sum(np.sin(np.deg2rad(x))),
np.sum(np.cos(np.deg2rad(x)))
)
)
#Wrap angles from (-pi,pi) to (0,360)
return (angle + 360) % 360
you can test it with
from timeit import repeat
import pandas as pd
import numpy as np
N_samples = int(1e4)
N_nan = N_var = int(0.02 * N_samples)
# Generate random data
data = np.random.rand(N_samples,2) * [30, 360]
data[np.random.choice(N_samples, N_nan), 1] = np.nan
data[np.random.choice(N_samples, N_var), 1] = 990
# Create dataset
df = pd.DataFrame(data, columns=['WindSpeed', 'WindDir'])
df.index = pd.date_range(start='2000-01-01 00:00', periods=N_samples, freq='10min')
# Run groupby + aggregate
grouped = df.groupby(pd.Grouper(freq='H')) # Data from 14.30 to 15.29 are rounded to 15.00
aggfuns1 = {'WindSpeed': np.mean, 'WindDir':meandir}
aggfuns2 = {'WindSpeed': np.mean, 'WindDir':np.mean}
res = repeat(stmt='grouped.agg(aggfuns1)', globals=globals(), number=1, repeat=10)
print(f'With custom aggregating function {min(res)*1000:.2f} ms')
res = repeat(stmt='grouped.agg(aggfuns2)', globals=globals(), number=1, repeat=10)
print(f'Without custom aggregating function {min(res)*1000:.2f} ms')
which on my PC for N_samples=1e4
outputs:
With custom aggregating function 1500.79 ms
Without custom aggregating function 2.08 ms
with the custom aggregating function being 750 times slower
and with N_samples=1e6
outputs:
With custom aggregating function 142967.17 ms
Without custom aggregating function 21.92 ms
with the custom aggregating function being 6500 times slower!
Is there a way to speed up this line of code?
The key is to try to vectorize everything you can on the whole
df
, and letgroupby
use only builtin methods.Here is a way to do that. The trick is to convert the angles to complex numbers, which numpy will happily sum (and
groupby
too, butgroupby
will refuse tomean()
). So, we convert the angles tocomplex
,sum
, then convert back to angles. The same "funny mean" of angles is used as in your code and described on the Wikipedia page you reference.About the handling of the special value (
990
), it can be vectorized too: comparings.groupby(...).count()
with.replace(val, nan).groupby(...).count()
finds all the groups where there is at least one of those.Anyway, here goes:
Application:
For convenience, I put your example generation into a function
gen_example(N_samples)
.Speed:
Testing: