Using weights for repeated cases in R (and specifically gam for binary response)

Question

Using weights for repeated cases in R (and specifically gam for binary response)

1.1k views Asked by DavidR At 19 July 2013 at 17:53

I've noticed many R models allow a "weights" parameter (e.g. cart, loess, gam,...). Most of the help functions describe it as "prior weights" for the data, but what does that actually mean?

I have data with many repeated cases and a binary response. I was hoping I could use "weights" to encode how many times each combination of input and response occurs, but this doesn't seem to work. I've also tried making the response the proportion of successes, and the weight the total trials for each combination of covariates, but this doesn't seem to work either (at least for gam). I'm trying to do this for all of the model types listed above, but for starters, how to do this for gam [mgcv package]?

Original Q&A

There are 2 answers

**Markus Loecher** · Answer 1 · 2013-07-20T02:51:04+00:00

I also used to think the weights were a convenient way of encoding sample sizes for repeated observations. But the following example shows that this is not the case for a simple linear model. I first define a contingency table with observed/invented shoe sizes and heights of people and fit a leats squares regression specifying the frequencies as the weights:

SKdata = matrix(c(20,5,5,5,40,15,3,27,30,2,3,10),ncol=4)
dimnames(SKdata) = list(shoesize=10:12,height=seq(160,190,by=10))

x = as.data.frame(as.table(SKdata), stringsAsFactors=FALSE)
for (i in 1:ncol(x)) x[,i] = as.numeric(x[,i])
fit1 = lm(height ~ shoesize,data=x, weights=Freq)
summary(fit1)

Notice that the coefficient for the slope is non significant and the residual error is based on "10 degrees of freedom"

This changes when I convert the contingency table into the "raw" data, meaning one row per observation, with the convenience function expand.dft:

expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".")
{
  DF <- sapply(1:nrow(x), function(i) x[rep(i, each = x$Freq[i]), ],
               simplify = FALSE)

  DF <- subset(do.call("rbind", DF), select = -Freq)

  for (i in 1:ncol(DF))
  {
    DF[[i]] <- type.convert(as.character(DF[[i]]),
                            na.strings = na.strings,
                            as.is = as.is, dec = dec)                                       
  }
  DF
} 

fit2 = lm(height ~ shoesize,data=expand.dft(x))
summary(fit2)

We obtain the identical coefficient but this time highly significant as based on "163 degrees of freedom"

**Hong Ooi** · Answer 2 · 2013-07-19T17:58:03+00:00

Weights for a binomial response have a natural interpretation: the number of trials corresponding to each observation. If you have n trials of which p are successes, you fit this with

glm(p/n ~ x, family=binomial, weights=n)

The same works with gam in both the gam and mgcv packages.

TechQA.

Using weights for repeated cases in R (and specifically gam for binary response)

There are 2 answers

Related Questions in R

Related Questions in LOESS

Related Questions in GAM

Related Questions in MGCV

Related Questions in CART-ANALYSIS

Popular Questions

Popular Tags

Trending Questions