I'm trying to simulate GLM functions in R including stochastic uncertainty. I compared a formula-based approach to the R-based simulate()
function and get different results. Not sure what I (probably its me and not R) am doing wrong.
I start by creating a simulation cohort:
set.seed(1)
library(MASS)
d <- mvrnorm(n=3000, mu=c(30,12,60), Sigma=matrix(data=c(45, 5, 40, 5, 15, 13, 40, 13, 300), nrow=3))
d[,1] <- d[,1]^2
Fit the model:
m <- glm(formula=d[,1]~d[,2] + d[,3], family=gaussian(link="sqrt"))
predict on linear scale (using formula)
p_linear <- m$coefficients[1] + m$coefficients[2]*d[,2] + m$coefficients[3]*d[,3]
Compare both approaches' predictions including the inverse link function. It seems to be similar:
sum((predict(object=m, type="response")-p_linear^2)^2)==0
Add stochastic part to both predictions:
sd_residuals <- sd(p_linear^2 - d[,1])
p <- p_linear^2 + rnorm(n=1000, mean=0, sd=sd_residuals)
p_simulate <- simulate(m)$sim_1
Compare both predictions (formula-based and simulate()
based) to original data in plot:
par(mfrow=c(1,2), mar=c(4,2,2,1))
plot(d[,1], p, col="blue", pch=20, cex=0.5, xlim=c(-500, 3000), ylim=c(-500, 3000))
abline(lm(p~d[,1]), col="blue")
points(d[,1], p_simulate, col="red", cex=0.5)
abline(lm(p_simulate~d[,1]), col="red")
abline(a=0, b=1)
...and compare both predictions:
plot(p, p_simulate, col="green", cex=0.5, xlim=c(-500, 3000), ylim=c(-500, 3000))
abline(lm(p_simulate~p), col="green")
abline(a=0, b=1)
It seems that the formula-based predictions reflect similar stochastic uncertainty as the original data. The r-based simulate()
function approach seems to represent less stochastic uncertainty than the original data. Is the formula-based approach correct? If not, how should I adapt it? (out of interest: what causes the difference between both approaches?)
Eventually, I want to use the GLM model on other than the data as input values for the predictions. I also want to apply it in Excel. Therefore, I'm looking for a formula-based approach and not using R-packages like simulate()
.
Any tips on reading material for predictions including stochastic in GLM in R are welcome. I googled a lot, but have difficulties finding formula-based examples in R for other than family/link is Gaussian/identity.