Outlier detection in probability/ frequency distribution

883 views Asked by At

I have following two dimensional dataset. Both (X and Y) are continuous random variables.

Z = (X, y) = {(1, 7), (2, 15), (3, 24), (4, 25), (5, 29), (6, 32), (7, 34), (8, 35), (9, 27), (10, 39)}

I want to detect outliers with respect to the y variable's values. The normal range for y variable is 10-35. Thus 1st and last pairs, in above dataset, are outliers and others are normal paris. I want to transform variable z = (x, y) into probability/ frequency distribution that outlier values (first and last pair) lies outside standard deviation 1. Can any one help me out to solve this problem.

PS: I have tried different distances such as eucledian and mahalanobis distances but they didn't worked.

1

There are 1 answers

3
cjtytler On BEST ANSWER

I'm not exactly sure what your end goal is, but I'm going to assume you format your x,y variables in a nx2 matrix, so z = [x,y] where x:= nx1 and y:= nx1 vectors.

So what you are asking is for a way to separate out data points where y is outside of 10-35 range? For that you can use a conditional statement to find indexes where that occurs:

index = z(:,2) <= 35 & z(:,2) >= 10;  %This gives vector of 0's & 1's length nx1
z_inliers = z(index,:);      %This has a [x,y] matrix of only inlier data points
z_outliers = z(~index,:);    %This has a [x,y] matrix of outlier data points

If you want to do this according to standard deviation then instead of 10 and 35 do:

low_range = mean(z(:,2)) - std(z(:,2));
high_range = mean(z(:,2)) + std(z(:,2));
index = y <= high_range & y >= low_range;

Then you can plot your pdf's or whatever with those points.