I am trying to use scikit-learn for predicting a value for an input text string.I am using HashingVectorizer for data vectorization and PassiveAggressiveClassifier for learning using partial_fit (refer to following code):
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.metrics import zero_one_loss
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier, Perceptron
from sklearn.pipeline import make_pipeline
from sklearn.externals import joblib
import pickle
a,r = [],[]
vectorizer = TfidfVectorizer()
with open('val', 'rb') as f:
r = pickle.load(f)
with open('text', 'rb') as f:
a = pickle.load(f)
L = (vectorizer.fit_transform(a))
training_set = L[:3250]
testing_set = L[3250:]
M = np.array(r)
training_result = M[:3250]
testing_result = M[3250:]
cls = np.unique(r)
model = PassiveAggressiveClassifier()
model.partial_fit(training_set, training_result, classes=cls)
print(model)
predicted = model.predict(testing_set)
print testing_result
print predicted
Error log:
File "try.py", line 89, in <module>
model.partial_fit(training_set, training_result, classes=cls)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
coef_init=None, intercept_init=None)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 374, in _partial_fit
coef_init, intercept_init)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 167, in _allocate_parameter_mem
dtype=np.float64, order="C")
MemoryError
I was previously using CountVectorizer and Logical Regression for classification and that worked without issues. But my learning data is approx. of millions of lines and I want to implement incremental learning using the above script which is causing Memory Error on each execution.
UPDATE:
After applying partial learning in loop, the partial_fit function returns unmatched number of features error(ValueError: Number of features 8897 does not match previous data 9190.
)
Also even if I set the max features attribute then the prediction generated is incorrect.
Is there any way with which the partial_fit method takes variable number of features?
Execution Output:
(400, 8481)
(400, 9277)
Traceback (most recent call last):
File "f9.py", line 65, in <module>
training_set, training_result, classes=cls)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
coef_init=None, intercept_init=None)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 379, in _partial_fit
% (n_features, self.coef_.shape[-1]))
ValueError: Number of features 9277 does not match previous data 8481.
Any help will be appreciated.
Thanks!
Memory Error is coming from the fact that you have too much data in memory. As you're loading data you have a quantity equal to N, then when you partial_fit, depending on the algorithm, it will store some more data, maybe close to N.
You don't need to store twice your data. Try to reduce the size of your initial chunk of data. Separate it in several parts that you will give to the
partial_fit
method.You should read you file line by line to create chunk of data, then fit that chunk, and flush the memory, then do it again