I have modified this tutorial (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) to build a text classifier on the Reuters Corpus. However, I get a bad input shape error:
EDIT: Thanks to the help of @Vivek Kumar, I have solved the Bad input shape issue. However, now I get an AttributeError: lower not found. After some research I think that it might have something to do with the Reuters corpus not having the correct form. Is there any way I can fix this?
This is my Code:
from sklearn.datasets import fetch_rcv1 #import reuters corpus
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
rcv1 = fetch_rcv1()
reuters_train = fetch_rcv1(subset='train', shuffle=True, random_state=42)
reuters_train.target_names
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(reuters_train.data)
train_counts.shape
count_vect.vocabulary_.get(u'alogrithm')
tf_transformer = TfidfTransformer(use_idf=False).fit(train_counts)
train_tf = tf_transformer.transform(train_counts)
train_tf.shape
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)
train_tfidf.shape
clf = MultinomialNB().fit(train_tfidf, reuters_train.target)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),])
text_clf.fit(reuters_train.data, reuters_train.target)
Pipeline(...)
import numpy as np
reuters_testset = fetch_rcv1(subset='test', shuffle=True, random_state=42)
reuters_test = reuters_testset.data
predicted = text_clf.predict(reuters_test)
np.mean(predicted == reuters_test.target)
I'm a real beginner at programming and NLP, so I really don't know very much about all of that stuff (yet). Thanks for any advice and help!
Thats because you are not using the actual data in the CountVectorizer. You are using
reuters_train
whereas you should be usingreuters_train.data
.Change:
to:
Also CountVectorizer + TfidfTransformer = TfidfVectorizer. So I would recommend using that inplace of two objects.
On further reading of the description of RCV1 dataset here, its given that the
.data
contains:So there is no need to actually do the CountVectorizer and TfidfTransformer on the data and you can directly use it like this:
But you will again encounter an error and this time due to the shape of target data. You see
MultinomialNB().fit()
only works with single dimension targets (may be multi-class or binary) but not with multi-label or multi-output data.TLDR; So you need to remove CountVectorizer and TfidfTransformer from your code because its already done in the data and you need to change the classifier MultinomialNB to any other which supports 2-d in target
y
like maybe DecisionTreeClassifier or others.