tidymodels not treating first factor level as positive class

438 views Asked by At

I am having an issue with tidymodels that I can't seem to figure out. Not sure if this is the intended behavior or an issue, but either way I would appreciate any help!

I am building a logistic regression prediction model with a two-level factor as the outcome, and per tidymodels convention have set the "positive class" as the first level.

The base R stats::glm() assumes exactly the opposite: that the "positive class" is the second level, and the "reference" is the first level.

With that in mind, I anticipated that fitting a model with a tidymodels workflow vs. stats::glm() would result in estimated coefficients with similar magnitude and opposite directions. However, it seems that in reality, tidymodels is behaving as stats::glm() and treating the second level as the positive class.

library(tidymodels)

#build model to predict "manual" (am == 1)
#Positive class is first level of factors per tidymodels convention
df <- 
  mtcars %>% 
  as_tibble() %>% 
  mutate(am = factor(am, levels = c("1", "0")))

#tidymodels
recipe <- recipe(df) %>% 
  update_role("am", new_role = "outcome") %>% 
  update_role("mpg", new_role = "predictor")

glm_model <- 
  logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

glm_wf <- 
  workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(glm_model)

glm_fit <-
  glm_wf %>%
  fit(df)

glm_fit %>%
  extract_fit_parsnip() %>%
  tidy(exponentiate = T)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  738.        2.35       2.81 0.00498
2 mpg            0.736     0.115     -2.67 0.00751

#base R
glm(am ~ 
      mpg,
    family = "binomial",
    data = df) %>% 
  tidy(exponentiate = T)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  738.        2.35       2.81 0.00498
2 mpg            0.736     0.115     -2.67 0.00751

#base R (treats second level as positive class) and tidymodels (treats first level as positive class) have the same output!

Any ideas? This is causing a lot of havoc when I try to report ORs and then use yardstick for performance assessment (yardstick assumes positive class is first). Thanks so much for the help, loving tidymodels overall.

0

There are 0 answers