As a preliminary step, I am using the IQR method to detect outliers for my dataset that is substantially large and skewed. And by large, I mean that each column has around 200,000 data points. I have a few questions regarding my methodology:
- Should I standardize or normalize the data before applying the IQR method?
- For certain columns, the Q1, min value, median and Q3 is the same value. What does it mean for the concerned column?
As an additional note, I am not able to share a sample of the dataset as it is confidential.
def detect_outliers_iqr(data):
outliers = []
data = sorted(data)
q1 = np.percentile(data, 25)
print(q1)
q3 = np.percentile(data, 75)
print(q3)
IQR = q3 - q1
lower_bound = q1 - (1.5 * IQR)
upper_bound = q3 + (1.5 * IQR)
for i in data:
if i < lower_bound or i > upper_bound:
outliers.append(i)
return outliers