gam in mgcv R with big number of covariates

1.2k views Asked by At

I would like to know if there is another way to write the function:

gam(VariableResponse ~ s(CovariateName1) + s(CovariateName2)  + ... + s(CovariateName100),
    family = gaussian(link = identity), data = MyData)

in mgcv package without typing 100 covariates' name as above? Supposing that in MyData I have only VariableResponse in column 1, CovariateName1 in column 2, etc.

Many thank!

1

There are 1 answers

0
Gavin Simpson On

Yes, use the brute force approach to generate a formula by pasting together the covariate names with the strings 's(' and ')' and then collapsing the whole things with ' + '. The convert the resultant string to a formula and pass that to gam(). You may need to fix issues with the formula's environment if gam() can't find the variable you name as it is going to do some NSE on the formula to identify which terms need smooths estimating and hence need to be replaced by a basis expansion.

library(mgcv)
set.seed(2) ## simulate some data... 
df <- gamSim(1, n=400, dist = "normal", scale = 2)

> names(df)
 [1] "y"  "x0" "x1" "x2" "x3" "f"  "f0" "f1" "f2" "f3"

We'll ignore the last 5 of those columns for the purposes of this example

df <- df[1:5]

Make the formula

fm <- paste('s(', names(df[ -1 ]), ')', sep = "", collapse = ' + ')
fm <- as.formula(paste('y ~', fm))

Now fit the model

m <- gam(fm, data = df)

> m

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x0) + s(x1) + s(x2) + s(x3)

Estimated degrees of freedom:
2.5 2.4 7.7 1.0  total = 14.6 

GCV score: 4.050519

You do have to be careful about fitting GAMs this way however; concurvity (the nonlinear counterpart to multicolinearlity in linear models) can cause catastrophically bad estimates of smooth functions.