Different estimates between bam and gam model (mgcv) and interaction term estimates edf of 0

1k views Asked by At

I am new to fitting gamm models and ran into two problems with my analysis.

  1. I ran the same model using the gam and the bam function of the package mgcv. The models give me different estimates, and I don't understand why and how to choose which function to use. Can anyone explain to me why these functions give different estimates?

  2. I am estimating a model including an interaction between age and condition (binomial factor with 2 conditions). For some reason one of the interaction terms (age:conditioncomputer or age:conditioncozmo) looks weird. It always gives a EDF and chi square of 0 and a p-value of 0.5, as if it was fixed to that. I tried using sum-to-zero and dummy contrasts, but that did not change the output. What is weird to me that there is a significant age effect, but this effect is not significant in neither condition. So I have the strong feeling that something is going wrong here.

Did anyone ever run into this before and can help me figure out if this is even a problem or normal, and how to solve it if it is a problem?

My model syntax is the following:

`bam(reciprocity ~ s(age,k=8) + condition + s(age, by = condition, k=8) + s(ID, bs="re") + s(class, bs="re") + s(school, bs="re"), data=df, family=binomial(link="logit"))`

This is the model output:

enter image description here

My df looks somewhat like this:

enter image description here

In short, I've used below code:

library(tidyverse)
library(psych)
library(mgcv)
library(ggplot2)
library(cowplot)
library(patchwork)
library(rstatix)
library(car)
library(yarrr)
library(itsadug)
df <- read.csv("/Users/lucaleisten/Desktop/Private/Master/Major_project/Data/test/test.csv", sep=",")

df$ID <- as.factor(as.character(df$ID))
df$condition <- as.factor(df$condition)
df$school <- as.factor(df$school)
df$class <- as.factor(df$class)
df$reciprocity <- as.factor(as.character(df$reciprocity))
summary(df)

model_reciprocity <- bam(reciprocity ~ s(age,k=7) +condition + s(age, by = condition, k=7) + s(ID, bs="re") + s(class, bs="re") + s(school, bs="re"), data=df, family=binomial(link="logit"))
summary(model_reciprocity)
1

There are 1 answers

0
Tripartio On

Based on the official documentation of bam, I surmise that gam is the more rigorous, accurate function. You should generally always use it. However, for large datasets, gam can be very slow. (For example, I ran gam on a dataset with 20000 rows × 70 predictors on a decently fast computer and I gave up after it just kept running for over 24 hours--I terminated the process.) bam is implemented differently so that it can run much more efficiently on large datasets, but, as far as I can tell, it seems to do so by taking a few shortcuts that mean that its results are less stable than gam on the same dataset. That said, for very large datasets, the differences are apparently so minimal that bam should be sufficiently reliable.

My understanding is based on the official documentation:

WARNINGS

The routine may be slower than optimal if the default "tp" basis is used.
This routine is less stable than `gam' for the same dataset.
With discrete=TRUE, te terms are efficiently computed, but t2 are not.

Details
When discrete=FALSE, bam operates by first setting up the basis characteristics for the smooths, using a representative subsample of the data. Then the model matrix is constructed in blocks using predict.gam. For each block the factor R, from the QR decomposition of the whole model matrix is updated, along with Q'y. and the sum of squares of y. At the end of block processing, fitting takes place, without the need to ever form the whole model matrix.

… For full method details see Wood, Goude and Shaw (2015).

Note that POI is not as stable as the default nested iteration used with gam, but that for very large, information rich, datasets, this is unlikely to matter much.