In my project i need to predict the word sequence
in sentence. I used OpenNLP
sentence detection
, tokenization
with their trained model. But i need to classify sequence of words in a sentence as one token for my related group. But their chunker
is not predicting the patterns.
For example if my group is Food items, then chunker should predict chicken pizza as one token.
can anybody explain how to train their model for our domain.
OpenNLP is open source, a quick poke through the source code shows me that they're using a Naive Bayes Classifier [source here]. Somewhere in there will be the code they used to train it. That will tell you both how to train it, and what type of corpus you need.
Re-training it will not be an afternoon project though, these things tend to be time-sinks. So depending on what you're doing it may be a better use of your time to use their classifier as is, even if this is not exactly what you're looking for. I'm not sure exactly what you're trying to do, but it may be possible to use some hack, like co-occurrence scores between your sequences of words (ie how often "chicken" and "pizza" appear together), as an approximation of what you're hoping to do with a re-trained classifier.