I know that the Treebank corpus is already tagged, but unlike the Brown corpus, I can't figure out how to get a dictionary of tags. For instance,
>>> from nltk.corpus import brown
>>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words())
This doesn't work on the Treebank corpus?
Quick solution:
For more details, see https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders#Penn_Tree_Bank
See also: Is there a way of avoiding so many list(chain(*list_of_list))?
Note that there are only 3000+ sentences from the Penn Treebank sample from NLTK, the brown corpus has 50,000 sentences.
To split the sentences up into training and test set:
If you're going to use brown corpus (that does not contain parsed sentence), you can used the
tagged_sent()
:As @alexis noted, unless you're splitting the corpus at sentence level. The
tagged_words()
function also exist in the Penn Treebank API in NLTK: