Why does summary overestimate the R-squared with a "no-intercept" model formula

Question

Why does summary overestimate the R-squared with a "no-intercept" model formula

8.8k views Asked by SESman At 02 December 2013 at 16:54

I wanted to make a simple linear model (lm()) without intercept coefficient so I put -1 in my model formula as in the following example. The problem is that the R-squared return by summary(myModel) seems to be overestimated. lm(), summary() and -1 are among the very classic function/functionality in R. Hence I am a bit surprised and I wonder if this is a bug or if there is any reason for this behaviour.

Here is an example:

x <- rnorm(1000, 3, 1)
mydf <- data.frame(x=x, y=1+x+rnorm(1000, 0, 1))
plot(y ~ x, mydf, xlim=c(-2, 10), ylim=c(-2, 10))

mylm1 <- lm(y ~ x, mydf)
mylm2 <- lm(y ~ x - 1, mydf)

abline(mylm1, col="blue") ; abline(mylm2, col="red")
abline(h=0, lty=2) ; abline(v=0, lty=2)

r2.1 <- 1 - var(residuals(mylm1))/var(mydf$y)
r2.2 <- 1 - var(residuals(mylm2))/var(mydf$y)
r2 <- c(paste0("Intercept - r2: ", format(summary(mylm1)$r.squared, digits=4)),
        paste0("Intercept - manual r2: ", format(r2.1, digits=4)),
        paste0("No intercept - r2: ", format(summary(mylm2)$r.squared, digits=4)),
        paste0("No intercept - manual r2: ", format(r2.2, digits=4)))
legend('bottomright', legend=r2, col=c(4,4,2,2), lty=1, cex=0.6)

enter image description here

Original Q&A

There are 2 answers

Tomas On 02 December 2013 at 17:09

Oh yeah, I fell into this trap too! Very good question!! It is because

enter image description here

and

enter image description here

in case of model with intercept (your mylm1), the y̅ is mean(y_i) - this is what you expect, this is the SS_tot you basicly want for proper R²
whereas in case of model without intercept, the y̅ is taken as 0 - so the SS_tot will be very high, so the R² will be very close to 1! SS_res will differ according to the worse fit (will be little higher without intercept), but not much.

Code:

attach(mylm1) # in general be careful with attach, here only for code clarity

y_fit <- mylm1$fitted.values
SSE <- sum((y_fit - y)^2)
SST <- sum((y - mean(y))^2)
1-SSE/SST  # R^2 with intercept

y_fit2 <- mylm2$fitted.values
SSE2 <- sum((y_fit2 - y)^2) # SSE2 only slightly higher than SSE..
SST2 <- sum((y - 0)^2)  # !!! the key difference is here !!!
1-SSE2/SST2 # R^2 without intercept

Note: It is not clear to me why in the model without intercept the y̅ is 0 and not mean(y_i), but that's how it is. I myself found out hard way by investigating and hacking with the above code..

**Dirk is no longer here** · Accepted Answer · 2013-12-02T17:00:38+00:00

Your formula is wrong. Here is what I do in fastLm.R in RcppArmadillo when computing the summary:

## cf src/library/stats/R/lm.R and case with no weights and an intercept 
f <- object$fitted.values    
r <- object$residuals         
mss <- if (object$intercept) sum((f - mean(f))^2) else sum(f^2)     
rss <- sum(r^2)   

r.squared <- mss/(mss + rss) 
df.int <- if (object$intercept) 1L else 0L    

n <- length(f)     
rdf <- object$df     
adj.r.squared <- 1 - (1 - r.squared) * ((n - df.int)/rdf)

There are two places where you need to keep track of whether there is an intercept or not.

TechQA.

Why does summary overestimate the R-squared with a "no-intercept" model formula

There are 2 answers

Related Questions in R

Related Questions in SUMMARY

Related Questions in INTERCEPT

Related Questions in LM

Popular Questions

Trending Questions