I am trying to implement active learning in Python. My classification problem currently takes Word2vec vector representations and feeds them into a Random Forest.
I have a tiny, initial train dataset and I would like to use the modAL package to exploit active learning and increase its size.
Here is what I've tried so far:
from modAL.models import ActiveLearner
learner = ActiveLearner(
estimator=RandomForestClassifier(),
query_strategy=modAL.uncertainty.uncertainty_sampling,
X_training=X_train0, y_training=y_train
)
test=test.reset_index()
for i in range(20):
query_idx, query_instance = learner.query(X_test0)
y_new = input('Classify:')
y_new=np.array([y_new])
learner.teach(np.array(
X_test0[query_idx].reshape(-1,1), y_new)
Where X_test0
is a pandas Dataframe with shape 1056x 100 (i.e 1056 examples with 100 features each, which are Word2vec representations). I leave this as if I had it unlabelled to later check performance.
Similarly, y_train
is another pandas dataframe containing the binary classification for the training data (0s or 1s).
My issue is that I want to make modAL understand that I am working under multiple features, and thus the classification is unique per every 100 length vector. In the example above, the following error appears:
ValueError: Found input variables with inconsistent numbers of samples: [100, 1]
It seems to me that it is not understanding that those 100 features correspond to only one label...
Any clue on how to solve it?
EDIT: I thought it might have been something with the reshaping function. Since it seems that it wants as an input an array, I also tried modifying the last line as follows:
learner.teach(X_test0.iloc[query_idx].values, np.array(y_new))
which now produces the following error:
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
Removing .values
to make it a dataframe also produces an error:
TypeError: <class 'pandas.core.series.Series'> datatype is not supported
``