Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags

Question

Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags

4.6k views Asked by Nate Cook3 At 07 June 2015 at 19:30

I'm not super experienced with Python, but I want to do some Data analytics with a corpus, so I'm doing that part in NLTK Python.

I want to go through the entire corpus and make a dictionary containing every word that appears in the corpus dataset. I want to be able to then search for a word in this dictionary and find the number of times that this word appeared as what part of speech (tag). So, for example, if I were to search for 'dog' I might find 100 noun tags and 5 verb tags, etc.

The final goal is to externally save this file as .txt or something and load it in another program to check probability of a word being which tag..

Would I do this with Counter and ngrams?

Original Q&A

There are 1 answers

**alexis** · Accepted Answer · 2015-06-07T19:48:02+00:00

Since you just want the POS of loose words you don't need ngrams, you need a tagged corpus. Assuming your corpus is already tagged, you can do it like this.

>>> from nltk.corpus import brown
>>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words())
>>> wordcounts["set"].tabulate(10)
VBN   VB   NN  VBD VBN-HL NN-HL 
159   88   86   71    2    2

A ConditionalFreqDist is basically a dictionary of Counter objects, with some extras thrown in. Look it up in the NLTK docs.

PS. If you want to case-normalize your words before counting, use

wordcounts = nltk.ConditionalFreqDist((w.lower(), t) for w, t in brown.tagged_words())

TechQA.

Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags

There are 1 answers

Related Questions in PYTHON

Related Questions in NLP

Related Questions in NLTK

Related Questions in CORPUS

Related Questions in TAGGED-CORPUS

Popular Questions

Popular Tags

Trending Questions