Possible to force logistic regression or other classifier through specific probability?

711 views Asked by At

I have a data set with a binary variable[Yes/No] and a continuous variable (X). I'm trying to make a model to classify [Yes/No] X.

From my data set, when X = 0.5, 48% of the observations are Yes. However, I know the true probability for Yes should be 50% when X = 0.5. When I create a model using logistic regression X = 0.5 != P[Yes=0.5].

How can I correct this? I guess all probabilities should be slightly underestimated if it does not pass true the correct point.

Is it correct just to add a bunch of observations in my sample to adjust the proportion?

Does not have to be just logistic regression, LDA, QDA etc is also of interest.

I have searched Stack Overflow, but only found topics regarding linear regression.

2

There are 2 answers

4
Ben Bolker On BEST ANSWER

I believe that in R (assuming you're using glm from base R) you just need

glm(y~I(x-0.5)-1,data=your_data,family=binomial)

the I(x-0.5) recenters the covariate at 0.5, the -1 suppresses the intercept (intercept = 0 at x=0.5 -> probability = 0.5 at x=0.5).

For example:

set.seed(101)
dd <- data.frame(x=runif(100,0.5,1),y=rbinom(100,size=1,prob=0.7))
m1 <- glm(y~I(x-0.5)-1,data=dd,family=binomial)
predict(m1,type="response",newdata=data.frame(x=0.5)) ## 0.5
0
davechilders On

The OP wrote:

How can I correct this? I guess all probabilities should be slightly underestimated if it does not pass true the correct point.

This is not true. It is perfectly possible to underestimate some values (like the intercept) and overestimate others.

An example following your situation:

The true probabilities:

set.seed(444)

true_prob <- function(x) {

  # logit probabilities
  lp <- (x - 0.5)

  # true probabilities
  p <- 1 / (1 + exp(-lp))
  p

}

true_prob(x = 0.5)
[1] 0.5

But if you simulate data and fit a model, the intercept could be underestimated and other values overestimated:

n <- 100
# simulated predictor
x <- runif(n, 0, 1)
probs <- true_prob(x)

# simulated binary response
y <- as.numeric(runif(n) < probs)

Now fit a model and compare true probabilities vs fitted ones:

> true_prob(0.5)
[1] 0.5
> predict(m, newdata = data.frame(x = 0.5), type = "response")
       1 
0.479328 
> true_prob(2)
[1] 0.8175745
> predict(m, newdata = data.frame(x = 2), type = "response")
        1 
0.8665702 

So in this example, model underestimates at x = 0.5 and overestimates at x = 2