How to find the average frequency of a POS-TAG per sentence

693 views Asked by At

I am having trouble finding the average frequency of a pos-tag per sentence in a pretty large document (10,000 words separated into paragraphs and punctuated). (ex. how often does an "NNP" appear per sentence).

This is what I came up with. First what I did was tokenize my data (a text file from the state of the union corpus) then apply the pos-tags as follows;

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
   try:
       for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
          print(tagged)

    except Exception as e:
    print(str(e))

process_content()

After that I tried to find the average frequency of the pos-tags using e.g.

tagged = [('the', 'DT'), ('elephant', 'NN')] 

from collections import Counter
counts = Counter(tag for word,tag in tagged)
print(counts)

I can find the average frequency of my target pos-tag in the text, but I don't know how to find the average frequency of the pos-tag per sentence across the whole text. I thought the above would work because I tokenized already but sadly not. Then I thought of dividing the average, target pos-tag appearance/sentence by the average sentence length but I couldn't come up with some code for that.

0

There are 0 answers