scikit-learn - making multilabel classification with svm.svc classifier, is it possible without probability=True?

Question

scikit-learn - making multilabel classification with svm.svc classifier, is it possible without probability=True?

1.6k views Asked by Maksim Khaitovich At 09 December 2014 at 21:19

I tried to achieve multilabel classification with Pipeline\onevsrest classifier in scikit-learn. Code is below, but let me mention first that I construct my multilabel examples from a pandas dataframe.

Code is below:

df = pd.read_csv(fileIn, header = 0, encoding='utf-8-sig')
rows = random.sample(df.index, int(len(df) * 0.9))

work = df.ix[rows]

work_test = df.drop(rows)

X_train = []

y_train = []

X_test = []

y_test = []
for i in work[[i for i in list(work.columns.values) if i.startswith('Change')]].values:
    X_train.append(','.join(i.T.tolist()))

X_train = np.array(X_train)

for i in work[[i for i in list(work.columns.values) if i.startswith('Corax')]].values:
    y_train.append(list(i))


for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Change')]].values:
    X_test.append(','.join(i.T.tolist()))

X_test = np.array(X_test)

for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Corax')]].values:
    y_test.append(list(i))


lb = preprocessing.MultiLabelBinarizer()

Y = lb.fit_transform(y_train)

classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(SVC(kernel='rbf')))])

classifier.fit(X_train, Y)

predicted = classifier.predict(X_test)

But the issue is that when you use this set of transformations: CountVectorizer -> TfidfTransformer you get a sparse matrix. The issue is that when you try to predict labels using OneVsRest classifier it looks for decision_function or predict_proba methods. predict_proba is not available on svm.SVC unless you specify probability=True. On the other hand, as I see in code, decision_function is not implemented for sparse matrices. Thus my code fails since none of these 2 required methods are available. But maybe I am doing something wrong? Is it possible to somehow achieve multilabel classification with svm.SVC without specifying probability=True? (doing this adds some significant overhead to classificator training), maybe by somehow forcing TfidfTransformer to output a dense matrix instead of sparse one?

Original Q&A

There are 1 answers

**Artem Sobolev** · Accepted Answer · 2014-12-10T09:43:42+00:00

This is a well-known issue and by now no easy solution exists.

You can use Pipeline to "densify" your sparse data (by calling .toarray), but this can blow up memory consumption. You can do TruncatedSVD (AFAIK, it's the only dimensionality reduction method that works with sparse data), but it can mess with your data so that SVM's performance would decrease.

TechQA.

scikit-learn - making multilabel classification with svm.svc classifier, is it possible without probability=True?

There are 1 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in CLASSIFICATION

Related Questions in DOCUMENT-CLASSIFICATION

Popular Questions

Popular Tags

Trending Questions