Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags

4.6k views Asked by At

I'm not super experienced with Python, but I want to do some Data analytics with a corpus, so I'm doing that part in NLTK Python.

I want to go through the entire corpus and make a dictionary containing every word that appears in the corpus dataset. I want to be able to then search for a word in this dictionary and find the number of times that this word appeared as what part of speech (tag). So, for example, if I were to search for 'dog' I might find 100 noun tags and 5 verb tags, etc.

The final goal is to externally save this file as .txt or something and load it in another program to check probability of a word being which tag..

Would I do this with Counter and ngrams?

1

There are 1 answers

7
alexis On BEST ANSWER

Since you just want the POS of loose words you don't need ngrams, you need a tagged corpus. Assuming your corpus is already tagged, you can do it like this.

>>> from nltk.corpus import brown
>>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words())
>>> wordcounts["set"].tabulate(10)
VBN   VB   NN  VBD VBN-HL NN-HL 
159   88   86   71    2    2 

A ConditionalFreqDist is basically a dictionary of Counter objects, with some extras thrown in. Look it up in the NLTK docs.

PS. If you want to case-normalize your words before counting, use

wordcounts = nltk.ConditionalFreqDist((w.lower(), t) for w, t in brown.tagged_words())