Are oversampling and undersampling approaches good to build good models?

90 views Asked by At

I just worked on "Heart Failure Prediction" dataset from kaggle ( https://www.kaggle.com/andrewmvd/heart-failure-clinical-data )

And i noticed the number of "Not dead" were more then the number of "dead" so i used SMOTETomek, which resampled my data and i predicted the accuracy and printed the confusion matrix, which had pretty good results then before.

df.DEATH_EVENT.value_counts()

0    202
1     95
Name: DEATH_EVENT, dtype: int64

accuracy and confusion matrix: before

0.7888888888888889
[[130  30]
[  8  12]]

the convertion code:

smt = SMOTETomek(random_state=42)
X_res,y_res = smt.fit_resample(X,y)
pd.DataFrame(y_res)['DEATH_EVENT'].value_counts()

1    155
0    155
Name: DEATH_EVENT, dtype: int64

accuracy and confusion matrix: after

0.912
[[53  5]
[ 6 61]]

but this was a small sample.

From your experience does using oversampling or undersampling approaches lead to better results in general? or do we get some kind of false results because of it and our model won't perform just as good in real world?

0

There are 0 answers