am trying to oversample data using imblearn using the below code
def oversample(df):
description = df['DESCRIPTION']
labels = df['LABEL']
vec = TfidfVectorizer(
norm='l2',
lowercase=True,
strip_accents=None,
encoding='utf-8',
preprocessor=None,
token_pattern=r"(?u)\S\S+")
desc = vec.fit_transform(description)
encoder = LabelEncoder()
encoder.fit(labels)
labels = encoder.transform(labels)
over = RandomOverSampler(random_state=0)
X, y = over.fit_resample(desc, labels)
oversampled_descriptions = vec.inverse_transform(X)
label = encoder.inverse_transform(y)
yet, am having an issue in text ordering, after I inverse_transform the data, I get the text in wrong order. How can I maintain same order ?
You can't.
inverse_transform() does not reconstruct back the document- It only return the n-grams that each document had and that were extracted during the fit. The only information it can use is the information that was stored in the vocabulary_ attribute.
You can add the indices of description to desc before the resample and then use them to assign oversampled_descriptions.