In order to run NB classifier in about 400 MB of text data i need to use vectorizer.
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(X_data)
But it is giving out of memory error. I am using Linux64 an python 64 bit version. How does people work through Vectorization process in Scikit for large data set (text)
Traceback (most recent call last):
File "ParseData.py", line 234, in <module>
main()
File "ParseData.py", line 211, in main
classifier = MultinomialNB().fit(X_train, y_train)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 313, in fit
Y = labelbin.fit_transform(y)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 408, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 272, in transform
neg_label=self.neg_label)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 394, in label_binarize
Y = np.zeros((len(y), len(classes)), dtype=np.int)
Edited (ogrisel): I changed the title from "Out of Memory Error in Scikit Vectorizer" to "Out of Memory Error in Scikit-learn MultinomialNB" to make it more descriptive of the actual problem.
Let me summarize the outcome of the discussion in the comments:
The label preprocessing machinery used internally in many scikit-learn classifiers does not scale well memory wise w.r.t. the number of classes. This is a known issue and there is ongoing work to tackle it.
The MultinomialNB class it-self will probably not be suitable to classify in a label space with cardinality 43K even if the label preprocessing limitation is fixed.
To address the large cardinality classification problem you could try:
fit binary
SGDClassifier(loss='log', penalty='elasticnet')
instances on columns ofy_train
converted as numpy arrays independently, then callclf.sparsify()
and finally wrap those sparse models as a final one-vs-rest classifier (or rank predictions of the binary classifier by proba). Dependending on the value of the regularizer parameter alpha you might get sparse models that are small enough to fit in memory. You can also try to do the same withLogisticRegression
, that is something like:clf_label_i = LogisticRegression(penalty='l1').fit(X_train, y_train[:, label_i].toarray()).sparsify()
alternatively try to do a PCA of the target labels
y_train
, then cast your classification problem as a multi-output regression problem in the reduced label PCA space, and then decode the regressor's output by looking for the nearest class encoding in the label PCA space.You can also have a look at Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification implemented in lightning but I am not sure it suitable for label cardinality 43K either.