scikit-learn - making multilabel classification with svm.svc classifier, is it possible without probability=True?

1.6k views Asked by At

I tried to achieve multilabel classification with Pipeline\onevsrest classifier in scikit-learn. Code is below, but let me mention first that I construct my multilabel examples from a pandas dataframe.

Code is below:

df = pd.read_csv(fileIn, header = 0, encoding='utf-8-sig')
rows = random.sample(df.index, int(len(df) * 0.9))

work = df.ix[rows]

work_test = df.drop(rows)

X_train = []

y_train = []

X_test = []

y_test = []
for i in work[[i for i in list(work.columns.values) if i.startswith('Change')]].values:
    X_train.append(','.join(i.T.tolist()))

X_train = np.array(X_train)

for i in work[[i for i in list(work.columns.values) if i.startswith('Corax')]].values:
    y_train.append(list(i))


for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Change')]].values:
    X_test.append(','.join(i.T.tolist()))

X_test = np.array(X_test)

for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Corax')]].values:
    y_test.append(list(i))


lb = preprocessing.MultiLabelBinarizer()

Y = lb.fit_transform(y_train)

classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(SVC(kernel='rbf')))])

classifier.fit(X_train, Y)

predicted = classifier.predict(X_test)

But the issue is that when you use this set of transformations: CountVectorizer -> TfidfTransformer you get a sparse matrix. The issue is that when you try to predict labels using OneVsRest classifier it looks for decision_function or predict_proba methods. predict_proba is not available on svm.SVC unless you specify probability=True. On the other hand, as I see in code, decision_function is not implemented for sparse matrices. Thus my code fails since none of these 2 required methods are available. But maybe I am doing something wrong? Is it possible to somehow achieve multilabel classification with svm.SVC without specifying probability=True? (doing this adds some significant overhead to classificator training), maybe by somehow forcing TfidfTransformer to output a dense matrix instead of sparse one?

1

There are 1 answers

7
Artem Sobolev On BEST ANSWER

This is a well-known issue and by now no easy solution exists.

You can use Pipeline to "densify" your sparse data (by calling .toarray), but this can blow up memory consumption. You can do TruncatedSVD (AFAIK, it's the only dimensionality reduction method that works with sparse data), but it can mess with your data so that SVM's performance would decrease.