How to correctly inverse_transform TFIDF vectorizer

1.1k views Asked by At

am trying to oversample data using imblearn using the below code

def oversample(df):

    description = df['DESCRIPTION']
    labels = df['LABEL']

    vec = TfidfVectorizer(
        norm='l2',
        lowercase=True,
        strip_accents=None,
        encoding='utf-8',
        preprocessor=None,
        token_pattern=r"(?u)\S\S+")
    desc = vec.fit_transform(description)
    encoder = LabelEncoder()
    encoder.fit(labels)
    labels = encoder.transform(labels)
    over = RandomOverSampler(random_state=0)
    X, y = over.fit_resample(desc, labels)
    oversampled_descriptions = vec.inverse_transform(X)
    label = encoder.inverse_transform(y)

yet, am having an issue in text ordering, after I inverse_transform the data, I get the text in wrong order. How can I maintain same order ?

1

There are 1 answers

0
Roeik On

You can't.

inverse_transform() does not reconstruct back the document- It only return the n-grams that each document had and that were extracted during the fit. The only information it can use is the information that was stored in the vocabulary_ attribute.

You can add the indices of description to desc before the resample and then use them to assign oversampled_descriptions.