I'm not super experienced with Python, but I want to do some Data analytics with a corpus, so I'm doing that part in NLTK Python.
I want to go through the entire corpus and make a dictionary containing every word that appears in the corpus dataset. I want to be able to then search for a word in this dictionary and find the number of times that this word appeared as what part of speech (tag). So, for example, if I were to search for 'dog' I might find 100 noun tags and 5 verb tags, etc.
The final goal is to externally save this file as .txt or something and load it in another program to check probability of a word being which tag..
Would I do this with Counter and ngrams?
Since you just want the POS of loose words you don't need ngrams, you need a tagged corpus. Assuming your corpus is already tagged, you can do it like this.
A
ConditionalFreqDist
is basically a dictionary ofCounter
objects, with some extras thrown in. Look it up in the NLTK docs.PS. If you want to case-normalize your words before counting, use