Analysing data for outliers for a highly skewed and large dataset

27 views Asked by Monami Bhuyan At 04 September 2023 at 21:05

As a preliminary step, I am using the IQR method to detect outliers for my dataset that is substantially large and skewed. And by large, I mean that each column has around 200,000 data points. I have a few questions regarding my methodology:

Should I standardize or normalize the data before applying the IQR method?
For certain columns, the Q1, min value, median and Q3 is the same value. What does it mean for the concerned column?

As an additional note, I am not able to share a sample of the dataset as it is confidential.

def detect_outliers_iqr(data):
    outliers = []
    data = sorted(data)
    q1 = np.percentile(data, 25)
    print(q1)
    q3 = np.percentile(data, 75)
    print(q3)
    IQR = q3 - q1
    lower_bound = q1 - (1.5 * IQR)
    upper_bound = q3 + (1.5 * IQR)
    for i in data:
        if i < lower_bound or i > upper_bound:
            outliers.append(i)
    return outliers

Original Q&A

TechQA.

Analysing data for outliers for a highly skewed and large dataset

There are 0 answers

Related Questions in IQR

Popular Questions

Trending Questions