Selecting only ordered factors in recipes for tidymodels

379 views Asked by At

I need to create a recipe using the recipes package from tidymodels. In one of the steps, I need to convert ordered factors to their ordinal score. But there seems to be no function that I can use in order to select all ordered factors.

I know that there is a function called all_nominal(), but that matches every column that is a factor, which can be either ordered or unordered. I have also tried has_type("ordered") but that does not work either.

Currently, I have to manually enter the column names. Is there an easier way to do this?

Below is an example of what I want to do:

library(mlbench)
data("BreastCancer")

rec <- recipe(Class ~ ., BreastCancer) %>%
    # Here, I want to select all ordered nominals instead of 
    #  listing them by name. Should there be a function 
    #  all_ordinal in recipes? Or is there another way
    #  to accomplish this?
    step_ordinalscore(Cl.thickness, 
                     Cell.size,
                     Cell.shape,
                     Marg.adhesion,
                     Epith.c.size)

Any help is welcome, thanks.

1

There are 1 answers

0
Julia Silge On

There isn't a special selector function for ordered factors, but you can find them yourself and then use that vector of names to step_ordinalscore().

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(mlbench)
data("BreastCancer")

## find all the ordered factors
ordered_names <- BreastCancer %>% 
  select(where(is.ordered)) %>% 
  names()

ordered_names
#> [1] "Cl.thickness"  "Cell.size"     "Cell.shape"    "Marg.adhesion"
#> [5] "Epith.c.size"

## convert all the ordered factors to scores
rec <- recipe(Class ~ ., BreastCancer) %>%
  step_ordinalscore(all_of(ordered_names))

rec
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         10
#> 
#> Operations:
#> 
#> Scoring for all_of(ordered_names)

prep(rec) %>% bake(new_data = NULL)
#> # A tibble: 699 x 11
#>    Id    Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
#>    <fct>        <dbl>     <dbl>      <dbl>         <dbl>        <dbl>
#>  1 1000…            5         1          1             1            2
#>  2 1002…            5         4          4             5            7
#>  3 1015…            3         1          1             1            2
#>  4 1016…            6         8          8             1            3
#>  5 1017…            4         1          1             3            2
#>  6 1017…            8        10         10             8            7
#>  7 1018…            1         1          1             1            2
#>  8 1018…            2         1          2             1            2
#>  9 1033…            2         1          1             1            2
#> 10 1033…            4         2          1             1            2
#> # … with 689 more rows, and 5 more variables: Bare.nuclei <fct>,
#> #   Bl.cromatin <fct>, Normal.nucleoli <fct>, Mitoses <fct>, Class <fct>

Created on 2020-10-19 by the reprex package (v0.3.0.9001)