Tag in NLTK movie review corpus

Question

Tag in NLTK movie review corpus

270 views Asked by Amit Naik At 27 August 2017 at 18:46

I have the following code to print the 15 most common occurrences in the movie_review corpus.

import nltk
import random
from nltk.corpus import movie_reviews

documents =[]

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((list(movie_reviews.words(fileid)), category))

random.shuffle(documents)

all_words =[]
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words =nltk.FreqDist(all_words)
print(all_words.most_common(15))

I get the following output:

[(u',', 77717), (u'the', 76529), (u'.', 65876), (u'a', 38106), (u'and', 35576), (u'of', 34123), (u'to', 31937), (u"'", 30585), (u'is', 25195), (u'in', 21822), (u's', 18513), (u'"', 17612), (u'it', 16107), (u'that', 15924), (u'-', 15595)]

Why is the letter 'u' coming in the tags? How can I resolve this?

Original Q&A

There are 2 answers

**alexis** · Answer 1 · 2017-08-27T21:13:24+00:00

alexis On 27 August 2017 at 21:13

You're seeing quotes, commas and (on Python 2.7) the u prefix because you are passing a list of pairs to print. Printing individual strings works as expected. For example:

for word, cnt in all_words.most_common(15):
    print word, cnt

**Alexander Sosnovshchenko** · Answer 2 · 2017-08-27T18:57:08+00:00

Alexander Sosnovshchenko On 27 August 2017 at 18:57

It's just a unicode strings in Python 2.7, nothing NLTK-specific.

TechQA.

Tag in NLTK movie review corpus

There are 2 answers

Related Questions in PYTHON-2.7

Related Questions in NLTK

Related Questions in TAGGED-CORPUS

Popular Questions

Popular Tags

Trending Questions