Is there a tidyverse/tidymodels (or base R) way to compute binary classification metrics by adjusting the threshold for a specific positive percentile?
The tidymodels guide suggests preparing a prediction probabilities dataframe which produces positive probabilities (.pred_1
) and also includes the actual classes Day90
:
> rf_fit %>% predict(test, type="prob") %>% bind_cols(test %>% select(Day90))
# A tibble: 31,586 × 3
.pred_1 .pred_0 Day90
<dbl> <dbl> <fct>
1 0.296 0.704 0
2 0.296 0.704 0
3 0.136 0.864 0
4 0.0690 0.931 0
5 0.0882 0.912 0
6 0.0948 0.905 0
7 0.157 0.843 0
8 0.0572 0.943 0
9 0.108 0.892 0
10 0.0466 0.953 0
# ℹ 31,576 more rows
# ℹ Use `print(n = ...)` to see more rows
type="quantile"
is promising but not available for parsnip's rand_forest()
.
Ideally there is a function that takes a positive percentile, say 20%, and finds a probability threshold k
that results in about 20% predicted positive. I could sort the probabilities and perform a linear or binary search on k
, but I'm sure this is already implemented in a more robust way. dplyr::percent_rank()
also seems promising.
use the
quantile
function: