Why is gam::step.gam returning NULL with forward selection?

1k views Asked by At

I have a large Generalized Additive Model (GAM) made from 10K observations with ~ 100 variables. Building the model with forward stepwise selection results in an object of class "NULL". Why might this be and how do I resolve it?

library(gam)

load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
load(url("https://github.com/cornejom/DataSets/raw/master/mygam.Rdata"))

myscope <- gam.scope(mydata, response = 3, arg = "df=4") #Target var in 3rd col.
mygam.step <- step.gam(mygam, myscope, direction = "forward")

mygam.step
NULL

The code that was used to fit mygam from mydata is:

library(gam)

#Identify numerical variables, but exclude the integer response.
numbers = sapply(mydata, class) %in% c("integer", "numeric")  
numbers[match("Response", names(mydata))] = FALSE 

#Identify factor variables.
factors = sapply(mydata, class) == "factor"

#Create a formula to feed into gam function.
myformula = paste0(paste0("Response ~ ", 
                          paste0("s(", names(mydata)[numbers], ", df=4)", collapse = " + ")
                          ),
                   " + ",
                   paste0(paste0(names(mydata)[factors], collapse = " + ")))

mygam = gam(as.formula(myformula), family = "binomial", mydata)
1

There are 1 answers

0
Karolis Koncevičius On BEST ANSWER

I suspect the issue is with the mygam object.

Explanation

If you read the help(step.gam) it has this paragraph in the explanation of scope argument:

The supplied model ‘object’ is used as the starting model, and hence there is the requirement that one term from each of the term formulas be present in ‘formula(object)’. This also implies that any terms in ‘formula(object)’ not contained in any of the term formulas will be forced to be present in every model considered. The function ‘gam.scope’ is helpful for generating the scope argument for a large model.

In essence this says that the first argument passed to step.gam function (mygam in this case) will have a formula and that formula will be used as a starting model for the stepwise procedure.

Since here we have forward stepwise - it cannot start from the full model, because in that case there is nothing left to add.

Exploring The Code

This idea is reinforced if we look at the code. The code of step.gam function has this loop that runs in case of forward selection.

if (forward) {
    trial <- items
    trial[i] <- trial[i] + 1
    if (trial[i] <= term.lengths[i] && !get.visit(trial,
      visited)) {
      visited <- cbind(visited, trial)
      tform.vector <- form.vector
      tform.vector[i] <- scope[[i]][trial[i]]
      form.list = c(form.list, list(list(trial = trial,
        form.vector = tform.vector, which = i)))
    }
}

Notice that the loop executes only when the inner if statement is TRUE. And that if statement seems to check if you have potential variables in your scope (term.length) that are not yet in your model (items, trial). If you don't - the loop skips.

Since in your case the loop never executes it doesn't form the return object and the procedure returns NULL.

The Solution

Given all the above - the solution is to not start with the complete formula when using forward selection method. Here for the demonstration I will be using the intercept-only model as a starting model:

library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
mygam <- gam(Response ~ 1, family = "binomial", mydata)

The last line is the only change that needs to be made. Everything else is the same as in the original post:

myscope <- gam.scope(mydata, response = 3, arg = "df=4")
mygam.step <- step.gam(mygam, myscope, direction = "forward")

And now the procedure works.