I am learning NLTK and have a question about data preprocessing and the MLE model. Currently I am trying to generate words with the MLE model. The problem is that when I pick an n>=3. My model will produce words completely fine until it gets to a period ('.'). Afterwards, it will only output end-of-sentence paddings.
This is essentially what I am doing.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(MYTEXTINPUT)]
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n)
model.fit(train_data, padded_sents)
model.generate(20)
# OUTPUT:
eg:
blah beep bloop . </s> </s> </s> </s> </s> </s> </s> </s> (continues till 20 words reached)
I suspect that the answer to my problem lies in the way my n-grams are prepared for the model. So is there a way to format/prepare the data so that, for example, trigrams, are generated like this --> ( . , </s>, <s> ) so that the model will try to start another sentence again and output more words ?
Or is there another way to avoid my problem written above ?
The question is when generating from a language model, when to stop generating.
A simple idiom for generating would have been:
From this tutorial snippet, in code that can be achieved with:
But there's actually a similar
generate()function already in NLTK, from https://github.com/nltk/nltk/blob/develop/nltk/lm/api.py#L182More details from the implementation on https://github.com/nltk/nltk/pull/2300 (note, see the hidden comments in the code review)