Identifying statistical outliers with pandas: groupby and individual columns

5.9k views Asked by At

I'm trying to understand how to identify statistical outliers which I will be sending to a spreadsheet. I will need to group the rows by the index and then find the stdev for specific columns and anything that exceeds the stdev would be used to populate a spreadsheet.

df = pandas.DataFrame({'Sex': ['M','M','M','F','F','F','F'], 'Age': [33,42,19,64,12,30,32], 'Height': ['163','167','184','164','162','158','160'],})

Using a dataset like this I would like to group by sex, and then find entries that exceed either the stdev of age or height. Most examples I've seen are addressing the stdev of the entire dataset as opposed to broken down by columns. There will be additional columns such as state, so I don't need the stdev of every column just particular ones out of the set.

Looking for the ouput to just contain the data for the rows that are identified as statistical outliers in either of the columns. For instance:

0  M  64  164
1  M  19  184

Assuming that 64 years old exceeds the men's stdevs set for height and 184 cm tall exceeds the stdevs for men's height

1

There are 1 answers

0
Alexander On BEST ANSWER

First, convert your height from strings to values.

df['Height'] = df['Height'].astype(float)

You then need to group on Sex using transform to create a boolean indicator marking if any of Age or Height is a statistical outlier within the group.

stds = 1.0  # Number of standard deviation that defines 'outlier'.
z = df[['Sex', 'Age', 'Height']].groupby('Sex').transform(
    lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
>>> outliers
     Age Height
0  False  False
1  False  False
2   True   True
3   True   True
4   True  False
5  False   True
6  False  False

Now filter for rows that contain any outliers:

>>> df[outliers.any(axis=1)]
   Age  Height Sex
2   19     184   M
3   64     164   F
4   12     162   F
5   30     158   F

If you only care about the upside of the distribution (i.e. values > mean + 2 SDs), then just drop the .abs(), i.e. lambda group: (group - group.mean()).div(group.std()).abs() > stds