Background
I have a computationally expensive (and SLOW) function which is called from within another function which is part of a dplyr pipeline:
dat %>%
mutate_at(.vars = vars(dplyr::intersect(starts_with("locale"), ends_with("last_name"))),
.funs = funs(native_last_name_alpha(., Type))) %>%
{.} -> dat
The function does one of two things, depending on whether the variable Type
(a character string) is match for "male" or "female". If there is a match, native_last_name_alpha
will run the computationally expensive and slow function which does some other stuff. If there is no match, native_last_name_alpha
will return NA
. Currently, because this is vectorised, I'm using if_else
and case_when
to determine what should happen, for example:
native_last_name_alpha <- function(locale, type) {
case_when(
type == "male" ~ stringi::stri_trans_toupper(
fake_single_alpha(locale = locale,
request = "last_name_male",
provider = "faker.providers.person")
),
type == "female" ~ stringi::stri_trans_toupper(
fake_single_alpha(locale = locale,
request = "last_name_female",
provider = "faker.providers.person")
),
TRUE ~ NA_character_
)
}
The problem is that the expensive function runs regardless of whether the condition evaluates to TRUE
or FALSE
, and this makes my script extremely slow to run.
Digging deeper into if_else, ifelse, and case_when
I understand that the vectorised if/else statements if_else
and ifelse
(and case_when
) don't work like a traditional if...else statement; all parts of the statement are evaluated and then the condition is used to splice together the results to be returned. For example, this code produces the following output and warnings:
v <- c(-100, -10, 10, 100)
ifelse(v > 0, log10(v), log10(-v))
[1] 2 1 1 2
Warning messages:
1: In ifelse(v > 0, log10(v), log10(-v)) : NaNs produced
2: In ifelse(v > 0, log10(v), log10(-v)) : NaNs produced
Both the return value if the condition is true, and the return value if the condition is false, are evaluated, and the condition is used to splice together the vector of results.
Consequently, my expensive and slow function is running many more times than actually required.
How can I avoid this?
What I'd like
I'm seeking alternative vectorised implementations of if_else
and case_when
that only evaluates the result-if-true when the condition is true.
What I've tried so far
I've tried writing my own vectorised implementations of if_else
/ifelse
, but with no success. I've also experimented with non-standard evaluation, but I don't know enough to make this work. I'm guessing that if I can get an if_else
to return an unevaluated expression that I then evaluate later at an appropriate time (kind of like a freeze-dried function call) this might be part of the solution. But no joy so far.
Is there anything I've missed to do what I want easily? Or can someone offer some hints on an implementation? Thanks!