I met the following two major problems when running logistic regression:
My X variables includes factor variables, such as immigrant status (immigrant
, non-immigrant
); my Y variable is a binomial variable, low birth weight (non-lbw
, lbw
).
I run the following R script (I am using plsRglm
package):
library(plsRglm)
model.plsrglm <- plsRglm(yair, xair, 3, modele="pls-glm-logistic")
1) If I do not drop all the NA
values in y or x, R returns this:
summary(model.plsrglm)
Call
plsRglmmodel.default(dataY = yair, dataX = xair, nt = 6,
modele = "pls-glm-logistic")
> model.plsrglm
Number of required components:
NULL
Number of successfully computed components:
NULL
Coefficients:
NULL
Information criteria and Fit statistics:
NULL
2) If I do drop all the NA
values before running the model, R gives an error:
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
So should I drop all NA
value before generating the model?
And should I make the factor variable into numeric? If so, how should I do that, just by using as.numeric
? But wouldn't it imply a level between non-immigrant
and immigrant
?
And for the Y variable, should I recode it as 0 and 1?
I added a reproducible dataset as below.
outcome c1 c2 c3 c4
1 lbw 120 yes <30 good
2 lbw 124 yes <30 good
3 lbw 125 yes <30 good
4 lbw 135 yes <30 good
5 lbw 112 yes <30 good
6 lbw 168 yes <30 good
7 lbw 147 yes 30-40 good
8 lbw 174 yes 30-40 fair
9 lbw 153 yes 30-40 fair
10 lbw 145 yes 30-40 fair
11 lbw 145 yes 30-40 fair
12 lbw 125 no >40 fair
13 lbw 125 no >40 poor
14 lbw 111 no >40 poor
15 non-lbw 80 no >40 poor
16 non-lbw 85 no >40 poor
17 non-lbw 78 yes >40 poor
18 non-lbw 67 no >40 poor
xair <- bc1997[,c("c1","c2","c3","c4")]
yair <- bc1997[,"outcome"]
model.plsrglm <- plsRglm(yair, xair, 2, modele="pls-glm-logistic")
summary(model.plsrglm)
But I got this error:
> model.plsrglm <- plsRglm(yair, xair, 2, modele="pls-glm-logistic")
____************************************************____
Family: binomial
Link function: logit
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Your 'x' terms must be numeric. Your variables "c2", "c3", and "c4" are all class logistic or factor.
The default setting for scaleX is TRUE, it is using colMeans() in order to scale your predictors. However, this is not possible with factors. Therefore, you can either convert each column to numeric or specify scaleX=FALSE.