I'm trying to understand how to identify statistical outliers which I will be sending to a spreadsheet. I will need to group the rows by the index and then find the stdev for specific columns and anything that exceeds the stdev would be used to populate a spreadsheet.
df = pandas.DataFrame({'Sex': ['M','M','M','F','F','F','F'], 'Age': [33,42,19,64,12,30,32], 'Height': ['163','167','184','164','162','158','160'],})
Using a dataset like this I would like to group by sex, and then find entries that exceed either the stdev of age or height. Most examples I've seen are addressing the stdev of the entire dataset as opposed to broken down by columns. There will be additional columns such as state, so I don't need the stdev of every column just particular ones out of the set.
Looking for the ouput to just contain the data for the rows that are identified as statistical outliers in either of the columns. For instance:
0 M 64 164
1 M 19 184
Assuming that 64 years old exceeds the men's stdevs set for height and 184 cm tall exceeds the stdevs for men's height
First, convert your height from strings to values.
You then need to group on
Sex
usingtransform
to create a boolean indicator marking if any ofAge
orHeight
is a statistical outlier within the group.Now filter for rows that contain any outliers:
If you only care about the upside of the distribution (i.e. values > mean + 2 SDs), then just drop the
.abs()
, i.e.lambda group: (group - group.mean()).div(group.std()).abs() > stds