Scikit Text Classification – Bad input shape error

912 views Asked by At

I have modified this tutorial (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) to build a text classifier on the Reuters Corpus. However, I get a bad input shape error:

EDIT: Thanks to the help of @Vivek Kumar, I have solved the Bad input shape issue. However, now I get an AttributeError: lower not found. After some research I think that it might have something to do with the Reuters corpus not having the correct form. Is there any way I can fix this?

This is my Code:

from sklearn.datasets import fetch_rcv1 #import reuters corpus
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

rcv1 = fetch_rcv1()


reuters_train = fetch_rcv1(subset='train', shuffle=True, random_state=42)
reuters_train.target_names

count_vect = CountVectorizer()

train_counts = count_vect.fit_transform(reuters_train.data)
train_counts.shape
count_vect.vocabulary_.get(u'alogrithm')

tf_transformer = TfidfTransformer(use_idf=False).fit(train_counts)
train_tf = tf_transformer.transform(train_counts)
train_tf.shape
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)
train_tfidf.shape

clf = MultinomialNB().fit(train_tfidf, reuters_train.target)

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),])

text_clf.fit(reuters_train.data, reuters_train.target)
Pipeline(...)

import numpy as np

reuters_testset = fetch_rcv1(subset='test', shuffle=True, random_state=42)

reuters_test = reuters_testset.data

predicted = text_clf.predict(reuters_test)

np.mean(predicted == reuters_test.target)

I'm a real beginner at programming and NLP, so I really don't know very much about all of that stuff (yet). Thanks for any advice and help!

1

There are 1 answers

3
Vivek Kumar On BEST ANSWER

Thats because you are not using the actual data in the CountVectorizer. You are using reuters_train whereas you should be using reuters_train.data.

Change:

train_counts = count_vect.fit_transform(reuters_train)

to:

train_counts = count_vect.fit_transform(reuters_train.data)

Also CountVectorizer + TfidfTransformer = TfidfVectorizer. So I would recommend using that inplace of two objects.

On further reading of the description of RCV1 dataset here, its given that the .data contains:

Non-zero values contains cosine-normalized, log TF-IDF vectors.

So there is no need to actually do the CountVectorizer and TfidfTransformer on the data and you can directly use it like this:

clf = MultinomialNB().fit(reuters_train.data, reuters_train.target)

But you will again encounter an error and this time due to the shape of target data. You see MultinomialNB().fit() only works with single dimension targets (may be multi-class or binary) but not with multi-label or multi-output data.

TLDR; So you need to remove CountVectorizer and TfidfTransformer from your code because its already done in the data and you need to change the classifier MultinomialNB to any other which supports 2-d in target y like maybe DecisionTreeClassifier or others.