Categorical Embeddings in an Unsupervised Setting for Anomaly Detection

155 views Asked by At

Context: I am working on an unsupervised use case. The Dataset I have has the following fields: TimeStamp, UserName and eventName Eg. User A has done Event B at Timestamp C

My objective is to perform an anomaly detection task. i.e. if UserA performs a new event C, tell if this is an anomaly or not.

My hypothesis is that if I can learn embeddings for events, this can give me good way to compare the similarity of the new event C with the previously performed events by User A and thus tell if this is an anomaly or not.

Now, the eventName is a categorical long tailed feature(i.e. few events are done in very large numbers while most of the events performed by user happen very infrequently) for most of the users. The number of distinct eventNames is in the range 300-400 where a user on an average might perform just 10 events out of these 300-400 on a day to day basis.

Question: I am not able to think through how do I go about learning the embeddings for events in my sample space.

I will highly appreciate any guidance on how to model this problem.

Do let me know if I missed providing any information that might help.

1

There are 1 answers

1
Jon Nordby On

Start simple. Divide the data up into a suitable time-interval, for example 1 day. And then compute basic statistics inside this interval. For example, how many events of each event type. Visualize these statistics across users and time, to get an idea of the patterns that are in your data. To compute anomaly score, find a way to compute a distance function from the features on a time-period compared to typical statistics. A basic starting point might be Mahalanobis distance. Or try some simple anomaly detection algorithms like IsolationForest, LocalOutlierFactor.

Only after this consider more advanced approaches. Like modelling grouping/sequences of events, or sub-population modelling of users, et.c