CountVectorizer(): StreamBackedCorpusView' object has no attribute 'lower'

2.5k views Asked by At

I am trying to run and instantiate CountVectorizer() on NLTK Movie reviews corpus, using the following code:

>>>import nltk
>>>import nltk.corpus
>>>from sklearn.feature_extraction.text import CountVectorizer
>>>from nltk.corpus import movie_reviews
>>>neg_rev = movie_reviews.fileids('neg')
>>>pos_rev = movie_reviews.fileids('pos')
>>>rev_list = [] # Empty List
>>>for rev in neg_rev:
    rev_list.append(nltk.corpus.movie_reviews.words(rev))
>>>for rev_pos in pos_rev:
    rev_list.append(nltk.corpus.movie_reviews.words(rev_pos))
>>>count_vect = CountVectorizer()
>>>X_count_vect = count_vect.fit_transform(rev_list)

I am getting the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-37-00e9047daa67> in <module>()
----> 1 X_count_vect = count_vect.fit_transform(rev_list)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
    837 
    838         vocabulary, X = self._count_vocab(raw_documents,
--> 839                                           self.fixed_vocabulary_)
    840 
    841         if self.binary:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
    760         for doc in raw_documents:
    761             feature_counter = {}
--> 762             for feature in analyze(doc):
    763                 try:
    764                     feature_idx = vocabulary[feature]

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
    239 
    240             return lambda doc: self._word_ngrams(
--> 241                 tokenize(preprocess(self.decode(doc))), stop_words)
    242 
    243         else:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(x)
    205 
    206         if self.lowercase:
--> 207             return lambda x: strip_accents(x.lower())
    208         else:
    209             return strip_accents

AttributeError: 'StreamBackedCorpusView' object has no attribute 'lower'

nltk.corpus.movie_reviews.words(rev_pos) has tokenized sentences.... such as:

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

Can anyone please tell me what am I doing wrong? I Suppose I am miising some tstep in creating the list of (rev_list) of movie reviews.

TIA

1

There are 1 answers

0
Jeremy Wicks On

It looks like your .words() function is not actually giving you back a list of tokens, but rather a series of StreamBackedCorpusView classes. This class allows you to retrieve the tokens but is not actually a full representation of the tokens itself.

You can, however, retrieve the tokens from the view. See the below link for more detail on working with StreamBackCorpusView.

http://nltk.sourceforge.net/corpusview/corpusview.StreamBackedCorpusView-class.html