R: logistic regression using frequency table, cannot find correct Pearson Chi Square statistics

4.4k views Asked by At

I was implement logistic regression to the following data frame and got a reasonable (the same as using STATA) results. But the Pearson chi square and degree of freedom I got from R is very different from STATA, which in turn gave me an very small p-value. And I cannot get the area under ROC curve. Could anyone help me to find out why residual() does not work on glm() with priori weights, and how to deal with area under ROC curve?

Following is my code and output.

1. Data

Here is my data frame test_data, y is outcome, x1 and x2 are covariates:

 y     x1        x2     freq
 0     No        0      268
 0     No        1       14
 0    Yes        0      109
 0    Yes        1        1
 1     No        0       31
 1     No        1        6
 1    Yes        0       45
 1    Yes        1        6

I generated this data frame from the original data by counting occurrence of each covariate pattern, and store the number in new variable freq.

2. GLM Model

Then I did the logistic regression as:

logit=glm(y~x1+x2, data=test_data, family=binomial, weights=freq)

Output shows:

Deviance Residuals: 1 2 3 4 5 6 7 8
-7.501 -3.536 -8.818 -1.521 11.957 3.501 10.409 2.129

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2010 0.1892 -11.632 < 2e-16 ***

x1 1.3538 0.2516 5.381 7.39e-08 ***

x2 1.6261 0.4313 3.770 0.000163 ***


Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

   Null deviance: 457.35  on 7  degrees of freedom

Residual deviance: 416.96 on 5 degrees of freedom AIC: 422.96

Number of Fisher Scoring iterations: 5

Coefficients are the same as STATA.

3. Chi Square Statistics

when I tried to calculate the Pearson chi square:

pearson_chisq = sum(residuals(logit, type = "pearson", weights=test_data$freq)^2)

I got 488, instead of 1.3 given by STATA. Also the DOF generated by R is chisq_dof = df.residuals(logit)=5, instead of 1. So I got an extremely small p-value~e^-100.

4. Discrimination

Then I calculated the area under ROC curve as: library(verification)

logit_mf = model.frame(logit)

roc.area(logit_mf $y, fitted(logit))$A

The output is:

[1] 0.5

Warning message:

In wilcox.test.default(pred[obs == 1], pred[obs == 0], alternative = "great") : cannot compute exact p-value with ties

Thanks!

1

There are 1 answers

0
J.Liu On

I figured out how to solve this problem eventually. The data set I used above should be summarised to covariate patterns. Then use the definition of Pearson chi square to do calculation. I provide the R code as follows:

# extract covariate patterns
library(dplyr)
temp=test_data %>% group_by(x1, x2) %>% summarise(m=sum(freq),y=sum(freq*y)) temp$pred=fitted(p01_logit_j)[1:4]

# calculate Pearson chi square
temp=mutate(temp, pearson=(y-mpred)/sqrt(mpred*(1-pred)))
pearson_chi2 = with(temp, sum(pearson^2)) temp_dof = 4-(2+1) #dof=J-(p+1)

# calculate p-value
pchisq(pearson_chi2, temp_dof, lower.tail=F)

The result of p-value is 0.241941, which is same as STATA.

In order to calculate AUC, we should first expand the covariate pattern to the "original" data, then use the "expanded" data to get AUC. Noted we have 392 "0" and 88 "1" in the frequency table. My code follows:

# expand observation
y_expand=c(rep(0,392),rep(1,88))
logit_mf = model.frame(logit)
logit_pred = fitted(logit)
logit_mf$freq=test_data$freq

# expand prediction
yhat_expand=with(logit_mf, rep(pred, freq))
library(verification)
roc.area(y_expand, yhat)$A

AUC=0.6760, same as that of STATA.