Python sklearn MultinomialNB: Dimension mismatch using DictVectorizer

216 views Asked by At

I'm trying to do MultinomialNB. I got Value Error: dimension mismatch.

I'm using DictVectorizer for the training data and LabelEncoder for the class.

This is my code:

def create_token(inpt):
    return inpt.split(' ')

def tok_freq(inpt):
    tok = {}
    for i in create_token(inpt):
        if i not in tok:
            tok[i] = 1
        else:
            tok[i] += 1
    return tok

training_data = []
for i in range(len(raw_data)):
    training_data.append((get_freq_of_tokens(raw_data.iloc[i].text), raw_data.iloc[i].category))

#vectorization
X, y = list(zip(*training_data))
label = LabelEncoder()
vector = DictVectorizer(dtype=float, sparse=True)
X = vector.fit_transform(X)
y = label.fit_transform(y)
multinb = mnb()
multinb.fit(X,y)

#vectorization for testing set
Xz = tok_freq(sms)
testX = vector.fit_transform(Xz)

multinb.predict(testX)

Which part of my code is wrong? Thanks.

1

There are 1 answers

0
Vivek Kumar On BEST ANSWER

Change

testX = vector.fit_transform(Xz)

to:

testX = vector.transform(Xz)

When you do fit() or fit_transform(), you are essentially training the vectorizer on the new data, which is not what you want. You only want to convert the test set in the same manner as on the train set, so only call transform()