Scikit-learn 0.15.2 - OneVsRestClassifier not works due to predict_proba not available

1.7k views Asked by At

I am trying to do onevsrest classification like below:

classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(SVC(kernel='rbf')))])

classifier.fit(X_train, Y)

predicted = classifier.predict(X_test)

And I get the error 'predict_proba is not available when probability = false'. I saw that there was a bug reported, the one below: https://github.com/scikit-learn/scikit-learn/issues/1946

And it was closed as fixed, so I killed scikit-learn from my Windows PC and completely re-downloaded scikit-learn to have version 0.15.2. But I still get this error. Any suggestions? Or I understood this wrong, and I still can't use SVC with OneVSRestClassifier unless I specify probability=true?

UPDATE: just to clarify, I am trying to actually achieve multi-label classification, here is data source:

df = pd.read_csv(fileIn, header = 0, encoding='utf-8-sig')
rows = random.sample(df.index, int(len(df) * 0.9))

work = df.ix[rows]

work_test = df.drop(rows)

X_train = []

y_train = []

X_test = []

y_test = []
for i in work[[i for i in list(work.columns.values) if i.startswith('Change')]].values:
    X_train.append(','.join(i.T.tolist()))

X_train = np.array(X_train)

for i in work[[i for i in list(work.columns.values) if i.startswith('Corax')]].values:
    y_train.append(list(i))


for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Change')]].values:
    X_test.append(','.join(i.T.tolist()))

X_test = np.array(X_test)

for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Corax')]].values:
    y_test.append(list(i))


lb = preprocessing.MultiLabelBinarizer()

Y = lb.fit_transform(y_train)

And after that I send it to pipeline mentioned earlier

1

There are 1 answers

7
Maksim Khaitovich On BEST ANSWER

Ok, I did some investigation in code. OneVsRestClassifier tries to call decision_function first and if it fails - it goes for predict_proba function of base classifier (svm.svc in our case).

As far as I see, my X_test is numpy.array of lists of strings. After it undergoes a sequence of transformations specified in pipeline CountVectorizer -> TfidfTransformer it becomes a sparse matrix (by design of these things). As I see currently decision_function is not available for sparse matrices, and there is even an open suggestion on github: https://github.com/scikit-learn/scikit-learn/issues/73

So, to summarize, looks like you can't make a multilabel classification using svm.svc unless you specify probability=True. If you do this you introduce some overhead to the classifier.fit process but it will work.