How to detect outlier in data using sliding IQR in Python/pandas?

72 views Asked by At

Ok so I've been working on this project where I am trying to detect an anomaly and relate it to some certain phenomenon. I know that pandas have builtin functions i.e. pd.rolling(window= frequency).statistics_of_my_choice() but for some reasons I am not getting the desired results. I have calculated rolling mean, r.median, r.upper & lower = mean +- 1.6 r.std.

But when I plot it, the upper and lower bounds are always above the data. IDK what's happening here, it doesn't make sense. Please take a look at the figure for a better understanding.

Here's what I am getting:

Here's what I am getting

and here's what I want to achieve:

This is what I want to acheive

Here's the paper that I am trying to implement: https://www.researchgate.net/publication/374567172_Analysis_of_Ionospheric_Anomalies_before_the_Tonga_Volcanic_Eruption_on_15_January_2022/figures

Here's my code snippet

def gen_features(df):
    
    df["ma"] = df.TEC.rolling(window="h").mean()
    df["mstd"] = df.TEC.rolling(window="h").std()
    df["upper"] = df["ma"] + (1.6* df.mstd)
    df["lower"] = df["ma"] - (1.6* df.mstd)
    
    return df 
1

There are 1 answers

15
Tino D On

From the publication:

"Since the solar activity cycle is 27 days, this paper uses 27 days as the sliding window to detect the ionospheric TEC perturbation condition before the volcanic eruption. The upper bound of TEC anomaly is represented as UB =Q2+ 1.5 IQR and the lower bound as LB =Q2−1.5IQR"

Implementing this in pandas:

# no seed for random, to try it many times
dataLength = 1000 # datalength
data = np.random.randint(1, 100, dataLength) # generate random data
outlierPercentage = 1 # controls amount of outliers in the data
outlierCount = int(dataLength/100 * outlierPercentage) # count of outliers
outlierIdx = np.random.choice(dataLength, outlierCount, replace=False) # choose randomly between the index of the outlier
data[outlierIdx] = np.random.randint(-300, 300, outlierCount) # choose a random int between -300 and 300
df = pd.DataFrame({'Data': data}) # generate the datafrane
winSize = 5 # define size of window 
# the statistics calculations...
Mean = df["Data"].rolling(window=winSize).mean()
Q1 = df["Data"].rolling(window=winSize).quantile(0.25)
Q3 = df["Data"].rolling(window=winSize).quantile(0.75)
IQR = Q3 - Q1
# assigning the upper limit and lower limit
df["UL"] = Mean + 1.5 * IQR
df["LL"] = Mean - 1.5 * IQR
# detect the outliers
outliersAboveUL = df[(df['Data'] > df['UL'])].index
outliersBelowLL = df[(df['Data'] < df['LL'])].index

Plotting gives you this:

plot

Imported packages:

import pandas as pd
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np

As you can see, this is a very basic example. I mainly added the correct calculation of the IQR. If you want a more detailed answer, I would need a sample of your data...

V2.0: with data from OP

This is currently what I have with the same approach:

df = pd.read_csv("airaStation.csv", index_col=0, parse_dates=True)
winSize = "29D" # define size of window 
# the statistics calculations...
Mean = df["TEC"].rolling(window=winSize).mean()
Q1 = df["TEC"].rolling(window=winSize).quantile(0.25)
Q3 = df["TEC"].rolling(window=winSize).quantile(0.75)
IQR = Q3 - Q1
# assigning the upper limit and lower limit
df["UL"] = Mean + 1.5 * IQR
df["LL"] = Mean - 1.5 * IQR
# detect the outliers
outliersAboveUL = df[(df['TEC'] > df['UL'])].index
outliersBelowLL = df[(df['TEC'] < df['LL'])].index

The plot:

results