What is a vectorized way to detect feature drift in python/pandas columns?

48 views Asked by At

I'm working on very large pandas dataframes that hold time series with significant feature drift. The drift is often sudden (e.g., the features would be 1.5-2.0x times larger than a few periods earlier).

I found several solutions to detect 'concept drift'. One convenient option is river. However, the solution is not vectorized.

Clearly, vectorized approaches are much, much faster - the easiest for example using the pandas built-ins to take moving averages and look whether those change/jump df.groupby().mean().rolling().

What are vectorized ways to handle the above task?

1

There are 1 answers

1
mudskipper On

One vectorized way to detect differences between successive rows is df[col].diff(). See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html

If you need to look at this inside known windows, you could perhaps combine this with a rolling average and threshold:

df[col].diff().rolling(window=5).mean() > threshold