Linked Questions

Popular Questions

Confusing results between NaiveBayes and LogistcRegression

Asked by At

I went through this quick tutorial on using Scikit learn and had a question about NaiveBayes vs Logistc Regression

Here is the link to the transcript -

You should be able to copy/paste the code below and run it. Please let me know if you get different answers!

import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

vect = CountVectorizer()

url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'
sms = pd.read_table(url, header=None, names=['label', 'message'])
sms['label_num'] = sms.label.map({'ham': 0, 'spam': 1})
X = sms.message
y = sms.label_num

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1)

vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

## NaiveBayes
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

# LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

# testing data
simple_text = ["this is a spam message spam spam spam"]
simple_test_dtm = vect.transform(simple_text)

# ***NaiveBayes***     
nb.predict(simple_test_dtm)
# array([1]) says this is spam

nb.predict_proba(simple_test_dtm)[:, 1]
# array([0.98743019]) 

# ****Logistic Regression***    
logreg.predict(simple_test_dtm)
# array([0]) says this is NOT spam

logreg.predict_proba(simple_test_dtm)[:, 1]
# array([0.05628297])

nb_pred_class = nb.predict(X_test_dtm)
metrics.accuracy_score(y_test, nb_pred_class)
# 0.9885139985642498

lg_pred_class = logreg.predict(X_test_dtm)
metrics.accuracy_score(y_test, lg_pred_class)
# 0.9877961234745154

Two questions:

1.) Why is NaiveBayes returning that it is Spam when LogisticRegression is saying that it is Ham?

Both classifiers return a high accuracy score, but give different answers? That is confusing me. Am I doing something wrong?

2.) What does the .predict_probab score mean? The way I thought I understood it was how accurate the classifiers response is. ie NB is saying it believes its answer (1) is 98% accurate, but that would mean LogReg is saying its answer (0) is 6% accurate.

Which doesn't make sense.

Any help would be greatly appreciated.

Related Questions