Transforming outliers in Pandas DataFrame using .apply, .applymap, .groupby

2.4k views Asked by At

I'm attempting to transform a pandas DataFrame object into a new object that contains a classification of the points based upon some simple thresholds:

  • Value transformed to 0 if the point is NaN
  • Value transformed to 1 if the point is negative or 0
  • Value transformed to 2 if it falls outside certain criteria based on the entire column
  • Value is 3 otherwise

Here is a very simple self-contained example:

import pandas as pd
import numpy as np

df=pd.DataFrame({'a':[np.nan,1000000,3,4,5,0,-7,9,10],'b':[2,3,-4,5,6,1000000,7,9,np.nan]})

print(df)

enter image description here

The transformation process created so far:

#Loop through and find points greater than the mean -- in this simple example, these are the 'outliers'
outliers = pd.DataFrame()
for datapoint in df.columns:
    tempser = pd.DataFrame(df[datapoint][np.abs(df[datapoint]) > (df[datapoint].mean())])
    outliers = pd.merge(outliers, tempser, right_index=True, left_index=True, how='outer')

outliers[outliers.isnull() == False] = 2


#Classify everything else as "3"
df[df > 0] = 3

#Classify negative and zero points as a "1"
df[df <= 0] = 1

#Update with the outliers
df.update(outliers)

#Everything else is a "0"
df.fillna(value=0, inplace=True)

Resulting in:

enter image description here

I have tried to use .applymap() and/or .groupby() in order to speed up the process with no luck. I found some guidance in this answer however, I'm still unsure how .groupby() is useful when you're not grouping within a pandas column.

1

There are 1 answers

3
JohnE On BEST ANSWER

Here's a replacement for the outliers part. It's about 5x faster for your sample data on my computer.

>>> pd.DataFrame( np.where( np.abs(df) > df.mean(), 2, df ), columns=df.columns )

    a   b
0 NaN   2
1   2   3
2   3  -4
3   4   5
4   5   6
5   0   2
6  -7   7
7   9   9
8  10 NaN

You could also do it with apply, but it will be slower than the np.where approach (but approximately the same speed as what you are currently doing), though much simpler. That's probably a good example of why you should always avoid apply if possible, when you care about speed.

>>> df[ df.apply( lambda x: abs(x) > x.mean() ) ] = 2

You could also do this, which is faster than apply but slower than np.where:

>>> mask = np.abs(df) > df.mean()
>>> df[mask] = 2

Of course, these things don't always scale linearly, so test them on your real data and see how that compares.