I am currently working hard on implementing my own library for English language processing. The real challenge is to go through all abundance of theoretical material and get the quantum of understading how to put it all on rails of production.
I have made some progress so far. I implemented end-of-sentence detector and Early parser. But the fact is unless I include in my terminal dictionary the specific word the parser can not recognize it and build chart.
To be more explicit please review the following example of my CFGrammar:
Production[] ppTerminals = { new Production(new Word[] { new Terminal("Preposition"), new NonTerminal("NP") })};
AddProduction(ppTerminal, "PP"); // Add production
...
DictionaryBuilder(Prepositions.SingleWord, "Preposition"); //Where
Prepositions.SingleWord is a hard-coded list of possible prepositions.
As a result if Earley parser comes across, let's say, unknown two-word PP like "up to" it will fail to recognize it and build chart.
So I think I need something else prior to syntax parser that will handles my sentences and then forward relevant data to parser. The main idea is that the dictionary was built up dynamically at the stage of POS tagging, then Earley parser can recognize a word.
I implemented tokenizer and lexer. As an output I have got S-Expression tree like:
(sentence
(word BOND)
(word TRADING)
(word REVENUES)
(word AT)
(word GOLDMAN)
(word SACHS)
(word SLID)
(value 40%)
...
(word AND)
(word CURRENCIES)
(word WAS)
(currency $1.16BN)
...
)
But I'm familiar with Hidden Markov Model and such algorithms like the Viterbi algorithm for finding the most likely state sequence and the Baum-Welch algorithm for parameter estimation.
Could you please give me just a piece of advice how to link together Earley parser and POS tagging based on HMM. Or, propabably, I am going a wrong direction so then please point out where I run wrong. Now I am a bit confused. Thank you!