Building openears compatible language model

6.8k views Asked by At

I am doing some development on speech to text and text to speech and I found the OpenEars API very useful.

The principle of this cmu-slm based API is it uses a language model to map the speech listened by the iPhone device. So I decided to find a big English language model to feed the API speech recognizer engine. But I failed to understand the format of the voxfourge english data model to use with OpenEars.

Do anyone have any idea that how can I get the .languagemodel and .dic file for English language to work with OpenEars?

2

There are 2 answers

1
Halle On BEST ANSWER

Old question, but maybe the answer is still interesting. OpenEars now has built-in language model generation, so one option is for you to create models dynamically in your app as you need them using the LanguageModelGenerator class, which uses the MITLM library and NSScanner to accomplish the same task as the CMU toolkit mentioned above. Processing a corpus with >5000 words on the iPhone is going to take a very long time, but you could always use the Simulator to run it once and get the output out of the documents folder and keep it.

Another option for large vocabulary recognition is explained here:

Creating ARPA language model file with 50,000 words

Having said that, I need to point out as the OpenEars developer that the CMU tool's limit of 5000 words corresponds pretty closely to the maximum vocabulary size that is likely to have decent accuracy and processing speed on the iPhone when using Pocketsphinx. So, the last suggestion would be to either reconceptualize your task so that it doesn't absolutely require large vocabulary recognition (for instance, since OpenEars allows you switch models on the fly, you may find that you don't need one enormous model but can get by with multiple smaller ones that you can switch in in different contexts), or to use a network-based API that can do large vocabulary recognition on a server (or make your own API that uses Sphinx4 on your own server). Good luck!

1
Tilo On

Regarding LM Formats:

AFAIK most Language Models use the ARPA standard for Language Models. Sphinx / CMU language models are compiled into binary format. You'd need the source format to convert a Sphinx LM into another format. Most other Language Models are in text format.

I'd recommend using the HTK Speech Recognition Toolkit ; Detailed Documentation here: http://htk.eng.cam.ac.uk/ftp/software/htkbook_html.tar.gz

Here's also a description of CMU's SLM Toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html

Here's an example of a language model in ARPA format I found on the net: http://www.arborius.net/~jphekman/sphinx/full/index.html

You probably want to create an ARPA LM first, then convert it into any binary format if needed.

In General:

To build a language model, you need lots and lots of training data - to determine what the probability of any other word in your vocabulary is, after observing the current input to this point in time.

You can't just "make" a language model by just adding the words you want to recognize - you also need a lot of training data (= typical input you observe when running your speech recognition application).

A Language Model is not just a word list -- it estimates the probability of the next token (word) in the input. To estimate those probabilities, you need to run a training process, which goes over training data (e.g. historic data), and observes word frequencies there to estimate above mentioned probabilities.

For your problem, maybe as a quick solution, just assume all words have the same frequency / probability.

  1. create a dictionary with the words you want to recognize (N words in dictionary)

  2. create a language model which has 1/N as the probability for each word (uni-gram language model)

you can then interpolate that uni-gram language model (LM) with another LM for a bigger corpus using HTK Toolkit