Average POS-TAG Frequency

831 views Asked by At

I want to take this tagged text (formatted as such) and find the average frequency of the pos-tag DT in each sentence. ex. DT appears 1/3 words in sentence1 and 1/3 words in sentence2. Then I want to add these up and divide by the number of sentences in the text (2 in this case). This will give me the average appearance of DT per sentence.

 from collections import Counter
 import nltk

 tagged_text = [('A', 'DT'), ('hairy', 'NNS'), ('dog', 'NN')]
 [('The', 'DT'), ('mischevious', 'NNS'), ('elephant', 'NN')]

 for eachSentence in tagged_text:
     Counter(tag for word,tag in tagged)/len(eachsentence.split())

 total = sum(counts.values())

 float(average) = sum(counts.values())/len(tagged_text.sents())
 print(float(average))

The big problem for me is the eachSentence part which I don't not how to get around (I don't know how to define what it is). I want this code to be able to be applied to hundreds of sentences that have the same format. I know there are a lot of problems with the code so if someone can please correct them I would be very grateful.

1

There are 1 answers

1
Igor On BEST ANSWER

I'm (also) not really sure what you are after. Perhaps you should try to structure your idea/requirements a bit more (in your head/on paper) before trying to put it into code. Based on your description and code, I can think of two possible figures that you're after, which can be obtained in the following way:

from collections import defaultdict

tagged_text = [[('A', 'DT'), ('hairy', 'NNS'), ('dog', 'NN')], [('The', 'DT'), ('mischevious', 'NNS'), ('elephant', 'NN')]]

d = defaultdict(int)
t = 0
for sentence in tagged_text:
    for tupl in sentence:
        tag = tupl[1]
        d[tag] += 1
        t += 1

for tag in d:
    print("Likelihood that %s appears in a sentence: %s" % (tag, str(float(d[tag] / len(tagged_text)))))
    print("Likelihood of %s appearing in complete corpus: %s" % (tag, str(float(d[tag] / t))))

Resulting in

Likelihood that NN appears in a sentence: 1.0
Likelihood of NN in complete corpus: 0.3333333333333333
Likelihood that NNS appears in a sentence: 1.0
Likelihood of NNS in complete corpus: 0.3333333333333333
Likelihood that DT appears in a sentence: 1.0
Likelihood of DT in complete corpus: 0.3333333333333333

All three tags appear in both sentences, hence likelihood of it appearing in a sentence is 1. All three tags both appear twice (on a total of six), hence a likelihood of 1/3 for them to appear (not regarding sentence distribution). But then again, not sure if this is what you're after.