Create Dictionary from Penn Treebank Corpus sample from NLTK?

Question

Create Dictionary from Penn Treebank Corpus sample from NLTK?

1.6k views Asked by Nate Cook3 At 16 June 2015 at 20:43

I know that the Treebank corpus is already tagged, but unlike the Brown corpus, I can't figure out how to get a dictionary of tags. For instance,

>>> from nltk.corpus import brown
>>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words())

This doesn't work on the Treebank corpus?

Original Q&A

There are 1 answers

**alvas** · Accepted Answer · 2015-06-16T21:44:18+00:00

Quick solution:

>>> from nltk.corpus import treebank
>>> from nltk import ConditionalFreqDist as cfd
>>> from itertools import chain
>>> treebank_tagged_words = list(chain(*list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()]))))
>>> wordcounts = cfd(treebank_tagged_words)
>>> treebank_tagged_words[0]
(u'Pierre', u'NNP')
>>> wordcounts[u'Pierre']
FreqDist({u'NNP': 1})
>>> treebank_tagged_words[100]
(u'asbestos', u'NN')
>>> wordcounts[u'asbestos']
FreqDist({u'NN': 11})

For more details, see https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders#Penn_Tree_Bank

See also: Is there a way of avoiding so many list(chain(*list_of_list))?

Note that there are only 3000+ sentences from the Penn Treebank sample from NLTK, the brown corpus has 50,000 sentences.

To split the sentences up into training and test set:

from nltk.corpus import treebank
from nltk import ConditionalFreqDist as cfd
from itertools import chain

treebank_tagged_sents = list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()]))


total_len = len(treebank_tagged_sents)
train_len = int(90 * total_len /100)

train_set = treebank_tagged_sents[:train_len]
print len(train_set)
train_treebank_tagged_words = cfd(chain(*train_set))

test_set = treebank_tagged_sents[train_len:]
print len(test_set)
test_treebank_tagged_words = cfd(chain(*test_set))

If you're going to use brown corpus (that does not contain parsed sentence), you can used the tagged_sent():

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents()
>>> len(brown_tagged_sents)
57340
>>> brown_tagged_sents[0]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> total_len = len(brown_tagged_sents)
>>> train_len = int(90 * total_len/100)
>>> train_set = brown_tagged_sents[:train_len]
>>> train_brown_tagged_words = cfd(chain(*train_set))
>>> train_brown_tagged_words['asbestos']
FreqDist({u'NN': 1})

As @alexis noted, unless you're splitting the corpus at sentence level. The tagged_words() function also exist in the Penn Treebank API in NLTK:

>>> from nltk.corpus import treebank
>>> from nltk.corpus import brown

>>> treebank.tagged_words()
[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), ...]
>>> brown.tagged_words()
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]

>>> type(treebank.tagged_words())
<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
>>> type(brown.tagged_words())
<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>

>>> from nltk import ConditionalFreqDist as cfd
>>> cfd(brown.tagged_words())
<ConditionalFreqDist with 56057 conditions>
>>> cfd(treebank.tagged_words())
<ConditionalFreqDist with 12408 conditions>

TechQA.

Create Dictionary from Penn Treebank Corpus sample from NLTK?

There are 1 answers

Related Questions in PYTHON

Related Questions in DICTIONARY

Related Questions in NLP

Related Questions in NLTK

Related Questions in CORPUS

Popular Questions

Popular Tags

Trending Questions