I am having trouble finding the average frequency of a pos-tag per sentence in a pretty large document (10,000 words separated into paragraphs and punctuated). (ex. how often does an "NNP" appear per sentence).
This is what I came up with. First what I did was tokenize my data (a text file from the state of the union corpus) then apply the pos-tags as follows;
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
After that I tried to find the average frequency of the pos-tags using e.g.
tagged = [('the', 'DT'), ('elephant', 'NN')]
from collections import Counter
counts = Counter(tag for word,tag in tagged)
print(counts)
I can find the average frequency of my target pos-tag in the text, but I don't know how to find the average frequency of the pos-tag per sentence across the whole text. I thought the above would work because I tokenized already but sadly not. Then I thought of dividing the average, target pos-tag appearance/sentence by the average sentence length but I couldn't come up with some code for that.