I am using smoter for balancing my data for regression. I have 130k samples, 3 feature columns, and 1 target column. Smoter is taking ages to balance the data. e.g. with smote from learning for classification, it took seconds. Am I doing something wrong or it's just the size of the data? The estimated time by the smoter is around 20h to balance all the data. I also checked how would it be for e.g. 20 % of the data so 13k samples, estimated time was around 2h...
import smogn
smogn.smoter(
## main arguments
data = df_gonzalez_healthy, ## pandas dataframe
y = 'healthy', ## string ('header name')
k = 9, ## positive integer (k < n)
samp_method = 'extreme', ## string ('balance' or 'extreme')
## phi relevance arguments
rel_thres = 0.80, ## positive real number (0 < R < 1)
rel_method = 'auto', ## string ('auto' or 'manual')
rel_xtrm_type = 'high', ## string ('low' or 'both' or 'high')
rel_coef = 2.25 ## positive real number (0 < R)
)
I don't think you're doing anything wrong, it's actually the case with many of the users.
It's probably because of a lot of for loops.
Author/developer has already said he's working on making smogn more efficient.