I'm modeling the punctuation prediction problem as arising from a hidden event model, and am trying to follow the algorithm described in Stolcke's paper Modeling the Prosody of Hidden Events for Improved Word Recognition.
After calculating an ngram model, he describes the algorithm for calculating the maximum likelihood sequence of events:
By using an N-gram model for P(W,S), and decomposing the prosodic likelihoods as in Equation 4, the joint model P(W,S,F) becomes equivalent to a hidden Markov model (HMM). The HMM states are the (word,event) pairs, while prosodic features form the observations. Transition probabilities are given by the N-gram model; emission probabilities are estimated by the prosodic model described below. Based on this construction, we can carry out the summation over all possible event sequences efficiently with the familiar forward dynamic programming algorithm for HMMs.
I'm confused how this can be a Markov model with states (word, event) since if our underlying model is an N-gram model, it seems to me that the state needs to encode the N-1 previous words in order to have all necessary information to predict the next state. What's going on here? Thanks!
 
                        
Table 1, describes the possible hidden labels for a sequence. And 3.2. describes the HMM semantic. So besically, the emissions are words and punctuations. The hidden labels are the labels in the sequence made of the tags proposed in Table 1.