How to see the performance of all gam models when model select=TRUE

2.2k views Asked by At

I run a gam with variable selection. But I want to evaluate the output of all the variables combination and not only for the best model for comparison. I am using mgcv package in R, is there some command for model evaluation (before I start coding many loops...).

Example:

    set.seed(3);n<-200
    dat <- gamSim(1,n=n,scale=.15,dist="poisson")
    dat$x4 <- runif(n, 0, 1);dat$x5 <- runif(n, 0, 1) ## spurious
    b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5),data=dat,
    family=poisson,select=TRUE,method="REML")

If I use summary(b), I only see the results of the best model.

2

There are 2 answers

0
Lorenzo M On

So, if I'm following you correctly, you want to see the model output for 2^N - 1 models where N is the number of variables, since every variable is either in a model or not. Sounds like you would need to make a vector containing each possible model specification and then use map to run each model specification and store the result you want in a list.

0
Gavin Simpson On

You misunderstand what select = TRUE is doing; there really is only one model here.

In a standard GAM fitted by mgcv, the wiggliness of each smooth in the model is determined during fitting by estimating parameters to minimise a penalised likelihood criterion. The penalty(ies) for each smooth can penalise wiggliness (typically curvature of the smoother, via a penalty on the squared second derivative), but they can't penalise any functions in the spline basis that are perfectly smooth (i.e. a straight line or linear function of the covariate). This is because a perfectly smooth component has no curvature, the slope doesn't change. Such functions are said to be in the penalty null space.

What Marra and Wood (2011) showed was two ways to add additional penalties to each smooth such that the penalization applied to both the wiggly functions and the functions in the null space. select = TRUE is one of these two options.

What you have then is a model where the penalties pull the smooth towards a linear function and pull the linear function towards 0 (a flat function). In other words we say the smooths are shrunk towards 0.

With select = TRUE, the model selection process is therefore more like the model selection approach known as the LASSO.

Whilst I say there is one model, there really is an infinite number of models, as you get a different model for all combinations of values of the "smoothness" parameters that control how much the penalty (wiggliness or null-space) affects the penalised likelihood. But this is the same as saying there are an infinite number of ordinary least squares (linear regression) models because the parameters of that model can take any real value. These parameters, just like the smoothness penalties in the GAM, are updated during fitting to arrive at a model. It just happens that the specific form of the null-space penalties implied by using select = TRUE ends up doing model selection for you.

Note that you pay the price for not knowing if a variable should be in the model or not; the reference degrees of freedom (Ref.df column in summary(model) output) are whatever value of k you set in the smooth. I.e. you pay the full cost of not knowing if the smooth uses k basis functions or ~0 (when a term is shrunk out of the model) or somewhere in between; you always pay the cost of k basis functions.