I was implement logistic regression to the following data frame and got a reasonable (the same as using STATA) results. But the Pearson chi square and degree of freedom I got from R is very different from STATA, which in turn gave me an very small p-value. And I cannot get the area under ROC curve. Could anyone help me to find out why residual() does not work on glm() with priori weights, and how to deal with area under ROC curve?
Following is my code and output.
1. Data
Here is my data frame test_data
, y is outcome, x1 and x2 are covariates:
y x1 x2 freq 0 No 0 268 0 No 1 14 0 Yes 0 109 0 Yes 1 1 1 No 0 31 1 No 1 6 1 Yes 0 45 1 Yes 1 6
I generated this data frame from the original data by counting occurrence of each covariate pattern, and store the number in new variable freq
.
2. GLM Model
Then I did the logistic regression as:
logit=glm(y~x1+x2, data=test_data, family=binomial, weights=freq)
Output shows:
Deviance Residuals: 1 2 3 4 5 6 7 8
-7.501 -3.536 -8.818 -1.521 11.957 3.501 10.409 2.129Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2010 0.1892 -11.632 < 2e-16 ***x1 1.3538 0.2516 5.381 7.39e-08 ***
x2 1.6261 0.4313 3.770 0.000163 ***
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 457.35 on 7 degrees of freedom
Residual deviance: 416.96 on 5 degrees of freedom AIC: 422.96
Number of Fisher Scoring iterations: 5
Coefficients are the same as STATA.
3. Chi Square Statistics
when I tried to calculate the Pearson chi square:
pearson_chisq = sum(residuals(logit, type = "pearson", weights=test_data$freq)^2)
I got 488, instead of 1.3 given by STATA. Also the DOF generated by R is chisq_dof = df.residuals(logit)
=5, instead of 1. So I got an extremely small p-value~e^-100.
4. Discrimination
Then I calculated the area under ROC curve as:
library(verification)
logit_mf = model.frame(logit)
roc.area(logit_mf $y, fitted(logit))$A
The output is:
[1] 0.5 Warning message:
In wilcox.test.default(pred[obs == 1], pred[obs == 0], alternative = "great") : cannot compute exact p-value with ties
Thanks!
I figured out how to solve this problem eventually. The data set I used above should be summarised to covariate patterns. Then use the definition of Pearson chi square to do calculation. I provide the R code as follows:
The result of p-value is 0.241941, which is same as STATA.
In order to calculate AUC, we should first expand the covariate pattern to the "original" data, then use the "expanded" data to get AUC. Noted we have 392 "0" and 88 "1" in the frequency table. My code follows:
AUC=0.6760, same as that of STATA.