Training Anomaly detection model on large datasets and chossing the correct model

270 views Asked by At

We are trying to build an anomaly detection model for application logs.

The preprocessing is already completed where we have built our own word2vec model which was trained on application log entries.

Now we have a training data of 1.5 M rows * 100 columns

Where each row is the vectorized representation of the log entries (the length of each vector is 100 hence 100 columns)

The problem is that most of the anomaly detection algorithms (LOF, SOS, SOD, SVM) are not scaling for this amount of data. We reduced the training size to 500K but still these algorithm hangs. SVM which performed best on POC sample data, does not have an option for n_jobs to run it on multiple cores.

Some algorithms are able to finish such as Isolation Forest (with low n_estimators), Histogram and Clustering. But these are not able to detect the anomalies which we purposely put in the training data.

Does anyone have an idea on how do we run the Anomaly detection algorithm for large datasets ?

Could not find any option for batch training in standard anomaly detection techniques.Shall we look into Neural Nets (autoencoders) ?

Selecting Best Model:

Given this is unsupervised learning, the approach we are taking for selecting a model is the following:

In the log entries training data, insert an entry from a novel (say Lord of the Rings). The vector representation of this log entry would be different from the rest of the log entires.

While running the dataset on various Anomaly detection algorithms, see which ones were able to detect the entry from the novel (which is an anomaly).

This approach worked when we tried to run anomaly detection on a very small dataset (1000 entries) where the log files were vectorized using the google provided word2vec model.

Is this approach a sound one ? We are open to other ideas as well. Given its an unsupervised learning algorithm we had to put in an anomalous entry and see which model was able to identify it.

The contaminiation ration put in is 0.003

1

There are 1 answers

0
Pankaj Mishra On BEST ANSWER

From your explanation, it seems that you are approaching a Novelty detection problem. The novelty detection problems are usually a semi-supervised problem (exceptions or approaches can vary).

Now the problem with huge matrix size can be solved if you use batch processing. This can help you- https://scikit-learn.org/0.15/modules/scaling_strategies.html

Finally yes, if you could use deep learning your problem can be solved in a much better way using both unsupervised learning or semi-supervised learning(I recommend this).