Prediction based on large texts using Vowpal Webbit

91 views Asked by At

I want to use the resolution time in minutes and the client description of the tickets on Zendesk to predict the resolution time of next tickets based on their description. I will use only this two values, but the description is a large text. I searched about hashing the feature values instead of hash the feature name on Vowpal Wabbit but with no success. Wich is the better approach to use feature values that is large texts to predict using Vowpal?

1

There are 1 answers

0
Martin Popel On BEST ANSWER

Values of features in Vowpal Wabbit can only be real numbers. If you have a categorical feature with n possible values you simply represent it as n binary features (so e.g. color=red is a name of a binary feature and its value is 1 by default).

If you have a text description you can use the individual words of the text as features (i.e. feature names). You only need to escape ":", "|" and whitespace characters in feature names, all other characters are allowed (including "="). So an example can look like

9 |USER avg_time:11 |SUMMARY words:5 sentences:1 |TEXT I have a big problem

So this ticket with text "I have a big problem" took 9 minutes to resolve and previous tickets from the same user took on average 11 minutes to resolve. If you have enough training examples, I would recommend to add many more features (any details about the user, more summary features about the text etc). Also the time of day (morning, afternoon, evening) and day of week when the ticket was reported may be a good predictor (tickets reported on Friday evening tend to take longer), but maybe you intentionally don't want to model this and focus only on the "difficulty" of the ticket irrelevant of reporting time.

You can also try using word bigrams as features with --ngram T2, which means that 2-grams features will be created for all namespaces beginning with T (only TEXT namespace in my example). Maybe the individual words "big" and "problem" are not strong predictors, but the bigram "big problem" will get a high positive weight (indicating it is a good predictor of long resolution time).

I will use only this two values

You mean resolution time and text of the ticket, am I right? But the resolution time is the (dependent) variable you want to predict, so this does not count as a feature (aka independent variable). Of course, if you know the identity of the user and have enough training examples for each user, you can include the average time of previous tickets (excluding the current one, of course) of the user as a feature as I tried to show in the example.