Analysing data for outliers for a highly skewed and large dataset

27 views Asked by At

As a preliminary step, I am using the IQR method to detect outliers for my dataset that is substantially large and skewed. And by large, I mean that each column has around 200,000 data points. I have a few questions regarding my methodology:

  1. Should I standardize or normalize the data before applying the IQR method?
  2. For certain columns, the Q1, min value, median and Q3 is the same value. What does it mean for the concerned column?

As an additional note, I am not able to share a sample of the dataset as it is confidential.

def detect_outliers_iqr(data):
    outliers = []
    data = sorted(data)
    q1 = np.percentile(data, 25)
    print(q1)
    q3 = np.percentile(data, 75)
    print(q3)
    IQR = q3 - q1
    lower_bound = q1 - (1.5 * IQR)
    upper_bound = q3 + (1.5 * IQR)
    for i in data:
        if i < lower_bound or i > upper_bound:
            outliers.append(i)
    return outliers
0

There are 0 answers