I have a large Generalized Additive Model (GAM) made from 10K observations with ~ 100 variables. Building the model with forward stepwise selection results in an object of class "NULL". Why might this be and how do I resolve it?
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
load(url("https://github.com/cornejom/DataSets/raw/master/mygam.Rdata"))
myscope <- gam.scope(mydata, response = 3, arg = "df=4") #Target var in 3rd col.
mygam.step <- step.gam(mygam, myscope, direction = "forward")
mygam.step
NULL
The code that was used to fit mygam
from mydata
is:
library(gam)
#Identify numerical variables, but exclude the integer response.
numbers = sapply(mydata, class) %in% c("integer", "numeric")
numbers[match("Response", names(mydata))] = FALSE
#Identify factor variables.
factors = sapply(mydata, class) == "factor"
#Create a formula to feed into gam function.
myformula = paste0(paste0("Response ~ ",
paste0("s(", names(mydata)[numbers], ", df=4)", collapse = " + ")
),
" + ",
paste0(paste0(names(mydata)[factors], collapse = " + ")))
mygam = gam(as.formula(myformula), family = "binomial", mydata)
I suspect the issue is with the
mygam
object.Explanation
If you read the
help(step.gam)
it has this paragraph in the explanation ofscope
argument:In essence this says that the first argument passed to
step.gam
function (mygam
in this case) will have a formula and that formula will be used as a starting model for the stepwise procedure.Since here we have forward stepwise - it cannot start from the full model, because in that case there is nothing left to add.
Exploring The Code
This idea is reinforced if we look at the code. The code of
step.gam
function has this loop that runs in case of forward selection.Notice that the loop executes only when the inner if statement is TRUE. And that if statement seems to check if you have potential variables in your scope (
term.length
) that are not yet in your model (items
,trial
). If you don't - the loop skips.Since in your case the loop never executes it doesn't form the return object and the procedure returns NULL.
The Solution
Given all the above - the solution is to not start with the complete formula when using forward selection method. Here for the demonstration I will be using the intercept-only model as a starting model:
The last line is the only change that needs to be made. Everything else is the same as in the original post:
And now the procedure works.