SMOTE - could not convert string to float

Question

SMOTE - could not convert string to float

11.5k views Asked by Math At 13 December 2020 at 21:25

I think I'm missing something in the code below.

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE


# Split into training and test sets

# Testing Count Vectorizer

X = df[['Spam']]
y = df['Value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)


sm =  pd.concat([X_resampled, y_resampled], axis=1)

as I'm getting the error

ValueError: could not convert string to float: ---> 19 X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)

Example of data is

Spam                                             Value
Your microsoft account was compromised             1
Manchester United lost against PSG                 0
I like cooking                                     0

I'd consider to transform both train and test sets to fix the issue which is causing the error, but I don't know how to apply to both. I've tried some examples on google, but it hasn't fixed the issue.

Original Q&A

There are 3 answers

Harsh Bangad On 11 October 2022 at 01:51

You can use SMOTENC instead of SMOTE. SMOTENC deals with categorical variables directly.

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html#imblearn.over_sampling.SMOTENC

Frtna2 On 08 January 2023 at 02:54

Tokenizing your string data before feeding it into SMOTE is an option. You can use any tokenizer and following torch implementation would be something like:

dataloader = torch.utils.data.DataLoader(dataset, batch_size=64)

X, y = [], []

for batch in dataloader:
    input_ids = batch['input_ids']
    labels = batch['labels']

    X.append(input_ids)
    y.append(labels)

X_tensor = torch.cat(X, dim=0)
y_tensor = torch.cat(y, dim=0)

X = X_tensor.numpy()
y = y_tensor.numpy()

smote = SMOTE(random_state=42, sampling_strategy=0.6)
X_resampled, y_resampled = smote.fit_resample(X, y)

**Ravi** · Accepted Answer · 2020-12-13T22:01:34+00:00

convert text data to numeric before applying SMOTE , like below.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(X_train.values.ravel())
X_train=vectorizer.transform(X_train.values.ravel())
X_test=vectorizer.transform(X_test.values.ravel())
X_train=X_train.toarray()
X_test=X_test.toarray()

and then add your SMOTE code

x_train = pd.DataFrame(X_train)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)

TechQA.

SMOTE - could not convert string to float

There are 3 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in SAMPLING

Related Questions in RESAMPLING

Related Questions in SMOTE

Popular Questions

Trending Questions