I am new to fitting gamm models and ran into two problems with my analysis.
I ran the same model using the gam and the bam function of the package mgcv. The models give me different estimates, and I don't understand why and how to choose which function to use. Can anyone explain to me why these functions give different estimates?
I am estimating a model including an interaction between age and condition (binomial factor with 2 conditions). For some reason one of the interaction terms (age:conditioncomputer or age:conditioncozmo) looks weird. It always gives a EDF and chi square of 0 and a p-value of 0.5, as if it was fixed to that. I tried using sum-to-zero and dummy contrasts, but that did not change the output. What is weird to me that there is a significant age effect, but this effect is not significant in neither condition. So I have the strong feeling that something is going wrong here.
Did anyone ever run into this before and can help me figure out if this is even a problem or normal, and how to solve it if it is a problem?
My model syntax is the following:
`bam(reciprocity ~ s(age,k=8) + condition + s(age, by = condition, k=8) + s(ID, bs="re") + s(class, bs="re") + s(school, bs="re"), data=df, family=binomial(link="logit"))`
This is the model output:
My df looks somewhat like this:
In short, I've used below code:
library(tidyverse)
library(psych)
library(mgcv)
library(ggplot2)
library(cowplot)
library(patchwork)
library(rstatix)
library(car)
library(yarrr)
library(itsadug)
df <- read.csv("/Users/lucaleisten/Desktop/Private/Master/Major_project/Data/test/test.csv", sep=",")
df$ID <- as.factor(as.character(df$ID))
df$condition <- as.factor(df$condition)
df$school <- as.factor(df$school)
df$class <- as.factor(df$class)
df$reciprocity <- as.factor(as.character(df$reciprocity))
summary(df)
model_reciprocity <- bam(reciprocity ~ s(age,k=7) +condition + s(age, by = condition, k=7) + s(ID, bs="re") + s(class, bs="re") + s(school, bs="re"), data=df, family=binomial(link="logit"))
summary(model_reciprocity)
Based on the official documentation of
bam
, I surmise thatgam
is the more rigorous, accurate function. You should generally always use it. However, for large datasets,gam
can be very slow. (For example, I rangam
on a dataset with 20000 rows × 70 predictors on a decently fast computer and I gave up after it just kept running for over 24 hours--I terminated the process.)bam
is implemented differently so that it can run much more efficiently on large datasets, but, as far as I can tell, it seems to do so by taking a few shortcuts that mean that its results are less stable than gam on the same dataset. That said, for very large datasets, the differences are apparently so minimal that bam should be sufficiently reliable.My understanding is based on the official documentation: