When setting l1_ratio = 0, the elastic net reduces to Ridge regression.
However, I am unable to match the results obtained from sklearn's ridgeCV versus ElasticNetCV. They appear to produce very different optimal alpha values:
import numpy as np
from sklearn.linear_model import ElasticNetCV, RidgeCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np
#data generation
np.random.seed(123)
beta = 0.35
N = 120
p = 30
X = np.random.normal(1, 2, (N, p))
y = np.random.normal(5, size=N) + beta * X[:, 0]
#lambdas to try:
l = np.exp(np.linspace(-2, 8, 80))
ridge1 = RidgeCV(alphas = l, store_cv_values=True).fit(X, y)
MSE_cv = np.mean(ridge1.cv_values_, axis =0)#.shape
y_pred = ridge1.predict(X=X)
MSE = mean_squared_error(y_true=y,y_pred=y_pred)
print(f"best alpha: {np.round(ridge1.alpha_,3)}")
print(f"MSE: {np.round(MSE,3)}")
which yields best alpha: 305.368, MSE: 0.952
While ElasticNetCV ends up with a similar MSE, its penalty parameters seem to be on a different scale (actually agreeing with the R implementation)
ridge2 = ElasticNetCV(cv=10, alphas = l, random_state=0, l1_ratio=0);
ridge2.fit(X, y)
y_pred = ridge2.predict(X=X)
MSE = mean_squared_error(y_true=y,y_pred=y_pred)
print(f"best alpha: {np.round(ridge2.alpha_,3)}")
print(f"MSE: {np.round(MSE,3)}")
yielding best alpha: 2.192, MSE: 0.934
Are the penalties defined differently ? Does one maybe divide by N ? Or is it due to the very different cross validation strategies ?
Yes, that's the cause of the discrepancy. In elastic-net the regularisation part of the cost function is scaled by the number of samples relative to the error term. This is not the case for RidgeCV. Therefore, to make the cost functions equivalent, we'd need to divide the
ElasticNetCValphas by the size of the training fold.RidgeCVuses an internal efficient LOO scheme. We can control for the CV scheme by settingcv=LeaveOneOut()inElasticNetCV- this means both models would be using LOO.ElasticNetCVwould otherwise default to 5-fold CV.When using
cv=LeaveOneOut()withElasticNet, the training fold size isn_samples-1, so that's what we need to scale by.Unlike
RidgeCV,ElasticNetCVdoesn't retain the CV scores for each alpha. I added a 'manual' version ofElasticNetCVwhere I combinedElasticNetwithGridSearchCVusing LOO, which gave me access to the MSE for each alpha (for comparison withRidgeCV).After applying the requisite scaling, the results line up:
The calculation of effective dof is my best guess at how it's done, referencing the supplied links (see comments in code).
According to https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator,
RidgeCVwith LOO CV, unlike other CV estimators, does not refit on the entire dataset (but rather refits onn_samples-1due to LOO). This is why its effective dof is slightly different from the other estimators that do refit on the entire dataset (n_samples).