I have a data set with a binary variable[Yes/No] and a continuous variable (X). I'm trying to make a model to classify [Yes/No] X.
From my data set, when X = 0.5, 48% of the observations are Yes. However, I know the true probability for Yes should be 50% when X = 0.5. When I create a model using logistic regression X = 0.5 != P[Yes=0.5].
How can I correct this? I guess all probabilities should be slightly underestimated if it does not pass true the correct point.
Is it correct just to add a bunch of observations in my sample to adjust the proportion?
Does not have to be just logistic regression, LDA, QDA etc is also of interest.
I have searched Stack Overflow, but only found topics regarding linear regression.
I believe that in R (assuming you're using
glm
from base R) you just needthe
I(x-0.5)
recenters the covariate at 0.5, the-1
suppresses the intercept (intercept = 0 atx=0.5
-> probability = 0.5 atx=0.5
).For example: