How to implement supervised class based language model in SRILM?

695 views Asked by At

I found tutorials where class based LM is implemented using Brown clustering passing just number of classes you want but I want to implement a class based model where I give class assignments initially. I tried this http://projects.csail.mit.edu/cgi-bin/wiki/view/SLS/SriLM. But this gives -99 to all ngrams in LM. There is very less documentation regarding this, Can anyone help me out?

1

There are 1 answers

3
Aaron On BEST ANSWER

I've done this before but it was several years ago. Let me see if I can retrace the steps for you.

The first step is to create the file that specifies the classes. It should have three columns. First is the class id, then the probability of that word given the class, and lastly the word.

Next step is to replace all the words in the training data with their class ids. You can use the SRILM replace-words-with-classes script or you can write your own script to do it.

Now you train a language model using ngram-count just like you would for a regular non-class n-gram model.

For evaluation you just specify the language model and also the class file.

ngram -ppl test_data.txt -lm class.lm -classes class_definition_file.txt