I tried to achieve multilabel classification with Pipeline\onevsrest classifier in scikit-learn. Code is below, but let me mention first that I construct my multilabel examples from a pandas dataframe.
Code is below:
df = pd.read_csv(fileIn, header = 0, encoding='utf-8-sig')
rows = random.sample(df.index, int(len(df) * 0.9))
work = df.ix[rows]
work_test = df.drop(rows)
X_train = []
y_train = []
X_test = []
y_test = []
for i in work[[i for i in list(work.columns.values) if i.startswith('Change')]].values:
X_train.append(','.join(i.T.tolist()))
X_train = np.array(X_train)
for i in work[[i for i in list(work.columns.values) if i.startswith('Corax')]].values:
y_train.append(list(i))
for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Change')]].values:
X_test.append(','.join(i.T.tolist()))
X_test = np.array(X_test)
for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Corax')]].values:
y_test.append(list(i))
lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train)
classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(SVC(kernel='rbf')))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
But the issue is that when you use this set of transformations: CountVectorizer -> TfidfTransformer
you get a sparse matrix. The issue is that when you try to predict labels using OneVsRest classifier it looks for decision_function
or predict_proba
methods. predict_proba
is not available on svm.SVC
unless you specify probability=True
. On the other hand, as I see in code, decision_function
is not implemented for sparse matrices. Thus my code fails since none of these 2 required methods are available. But maybe I am doing something wrong? Is it possible to somehow achieve multilabel classification with svm.SVC
without specifying probability=True?
(doing this adds some significant overhead to classificator training), maybe by somehow forcing TfidfTransformer to output a dense matrix instead of sparse one?
This is a well-known issue and by now no easy solution exists.
You can use Pipeline to "densify" your sparse data (by calling
.toarray
), but this can blow up memory consumption. You can do TruncatedSVD (AFAIK, it's the only dimensionality reduction method that works with sparse data), but it can mess with your data so that SVM's performance would decrease.