I'm trying to fit a logistic regression with RFECV. That's my code:
log_reg = LogisticRegression(solver = "lbfgs",
max_iter = 1000)
random.seed(4711)
rfecv = RFECV(estimator = log_reg,
scoring = "accuracy",
cv = 10)
Model = rfecv.fit(X_train, y_train)
I don't think there is anything wrong with my data or my code, but the accuracy is exactly the same for almost every different value of feature size:
Model.grid_scores_
array([0.76200776, 0.76200776, 0.76200776, 0.76200776, 0.76200776,
0.76200776, 0.76200776, 0.76200776, 0.76200776, 0.76200776,
0.76200776, 0.76200776, 0.76200776, 0.76200776, 0.76200776,
0.76200776, 0.76200776, 0.76200776, 0.76200776, 0.76556425,
0.80968999, 0.80962074])
How can this happen? My data is quite big (more than 20000 observations). I cannot imagine that in every fold of the cross validation the same cases are estimated correctly. But if so how could this happen? 1 variable can explain as much as 19 can but not as much as 20 could? Then why don't take the first and the 20th? I'm really confused.
I believe all your accuracies are the same because
LogisticRegression
uses L2 regularization by default. That is,penalty='l2'
unless you pass it something else.This means that even when
Model
is using all 22 features, the underlying algorithmlog_reg
is penalizing the beta coefficients using L2 regularization. So if you prune the least important features, it won't affect the accuracy because the underlying logit model with 22 features has pushed the coefficients of the least important features close to zero.I suggest you try: