Outliers using RPCA

3.5k views Asked by At

I read about using RPCA to find outliers on time series data. I have an idea about the fundamentals of what RPCA is about and the theory. I got a Python library that does RPCA and pretty much got two matrices as the output (L and S), a low rank approximation of the input data and a sparse matrix.

Input data:(rows being a day and 10 features as columns.)

DAY 1 - 100,300,345,126,289,387,278,433,189,153  
DAY 2 - 300,647,245,426,889,987,278,133,295,153  
DAY 3 - 200,747,145,226,489,287,378,1033,295,453

Output obtained :

L  
[[ 125.20560531  292.91525518   92.76132814  141.33797061  282.93586313
   185.71134917  199.48789246   96.04089205  192.11501055  118.68811072]  
 [ 174.72737183  408.77013914  129.45061871  197.24046765  394.84366245
   259.16456278  278.39005349  134.0273274   268.1010231   165.63205458]  
 [ 194.38951303  454.76920678  144.01774873  219.43601655  439.27557808
   288.32845493  309.71739782  149.10947628  298.27053871  184.27069609]]

S  
[[ -25.20560531    0.          252.23867186   -0.            0.
   201.28865083   78.51210754  336.95910795   -0.           34.31188928]  
 [ 125.27262817  238.22986086  115.54938129  228.75953235  494.15633755
   727.83543722   -0.           -0.           26.8989769    -0.        ]  
 [   0.          292.23079322   -0.            0.           49.72442192
    -0.           68.28260218  883.89052372    0.          268.72930391]]

Inference: (My question)

Now how do I infer the points that could be classified as outliers. For ex. by looking at the data, we could say 1033 looks like an outlier. The corresponding entry in S matrix is 883.89052372 which is more compared to other entries in S. Could the notion of having a fixed threshold to find the deviations of S matrix entries from the corresponding original value in the input matrix be used to determine that the point is an outlier ? Or am I completely understanding the concept of RPCA wrong ? TIA for your help.

1

There are 1 answers

2
O. Gindele On BEST ANSWER

You understood the concept of robust PCA (RPCA) correctly: The sparse matrix S contains the outliers. However, S will often contain many observations (non-zero values) you might not classify as anomalies yourself. As you suggest it is therefore a good idea to filter out these points.

Applying a fixed threshold to identify relevant outliers could potentially work for one dataset. However, using the threshold on many datasets might give poor results if there are changes in mean and variance of the underlying distribution.

Ideally you calculate an anomaly score and then classify the outliers based on that score. A simple method (and often used in outlier detection) is to see if your data point (potential outlier) is at the tail of your assumed distribution. For example, if you assume your distribution is Gaussian you can calculate the Z-score (z):

z = (x-μ)/σ,

where μ is the mean and σ is the standard deviation.

You can then apply a threshold to the calculated Z-score in order to identify an outlier. For example: if for a given observation z > 3, the data point is an outlier. This means your observation is more than 3 standard deviations from the mean and it is in the 0.1% tail of the Gaussian distribution. This approach is more robust to changes in the data than using a threshold on the non-standardized values. Furthermore tuning the z value at which you classify the outlier is simpler than finding a real scale value (883.89052372 in your case) for each dataset.