The setup
I am using R and the Tidverse, and I am writing pure functions, Tidyverse style.
I have a relational database where the IDs that link the databases are, unfortunately, written in different formats. For example "Fries, french" versus "French fries". To solve this problem, I want a function that standardizes the names, so that I can use it like this:
tibble_a_with_ids_written_well <-
tibble_a_with_ids_written_incorrectly |>
mutate(id = id |> name_standardizer)
Afterwards I'd perform a join:
tibble_a_with_ids_written_well |>
left_join(tibble_b, by = "id")
The name_standardizer function contains a hard-coded tibble with 2 columns that I seek to manually update every time I encounter a new way of writing something down. With that in mind, here's how I am writing the function:
name_standardizer <- function(incoming_name){
hardcoded_dictionary <-
tribble(~incorrect_name, ~correct_name,
"Fries, french", "French fries",
"Hamborgar", "Hamburger")
hardcoded_dictionary |>
filter(
incoming_name == incorrect_name |
incoming_name == correct_name) |>
pull(correct_name)
}
The problem
This works well so far, but the instant I use this function with another database I get an error. Here's an example of a third tibble:
tibble_c_with_ids_written_incorrectly <-
tribble(~id, ~health_rating,
"Burger", 3.8)
I can then update the function's dictionary:
hardcoded_dictionary <-
tribble(~incorrect_name, ~correct_name,
"Fries, french", "French fries",
"Hamborgar", "Hamburger",
"Burger", NA) |> # NAs and fill mean I write down the correct name less.
fill(correct_name) # This reduces human error.
Then, if I run: tibble_c_with_ids_written_incorrectly |> mutate(id = id |> name_standardizer()), everthing works well. However, with this new name_standardizer dictionary, I can no longer standardize tibble_a. Here's the error message:
> tibble_a_with_ids_written_well <-
+ tibble_a_with_IDs_written_incorrectly |>
+ mutate(id = id |> name_standardizer())
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `id = name_standardizer(id)`.
Caused by warning:
! There were 2 warnings in `filter()`.
The first warning was:
ℹ In argument: `incoming_name == incorrect_name | incoming_name == correct_name`.
Caused by warning in `incoming_name == incorrect_name`:
! longer object length is not a multiple of shorter object length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.
My intuition is that I am not dealing successfully with vectorized functions. There is something that I am missing here.
What can I change in my function so that I can have a hardcoded dictionary and be able to use my function in a versatile way with many different tibbles?
Reproducible example
Here's my code so that you can have a reproducible example:
library(tidyverse)
tibble_a_with_IDs_written_incorrectly <-
tribble(~id, ~price,
"Fries, french", 2,
"Hamborgar", 7)
name_standardizer <- function(incoming_name){
hardcoded_dictionary <-
tribble(~incorrect_name, ~correct_name,
"Fries, french", "French fries",
"Hamborgar", "Hamburger",
"Burger", NA) |> # NAs and fill mean I write down the correct name less.
fill(correct_name) # This reduces human error.
hardcoded_dictionary |>
filter(
incoming_name == incorrect_name |
incoming_name == correct_name) |>
pull(correct_name)
}
tibble_a_with_ids_written_well <-
tibble_a_with_IDs_written_incorrectly |>
mutate(id = id |> name_standardizer())
tibble_b <-
tribble(~id, ~amount_in_stock,
"French fries", 4,
"Hamburger", 3)
tibble_a_with_ids_written_well |>
left_join(tibble_b, by = "id")
tibble_c_with_ids_written_incorrectly <-
tribble(~id, ~health_rating,
"Burger", 3.8)
tibble_c_with_ids_written_incorrectly |>
mutate(id = id |> name_standardizer())
I have tried giving the function the whole tibble, instead of using mutate on a column. That also did not work.
==does pairwise tests of equality, so inIt is internally doing
This pairwise operation works perfectly well when both the LHS and RHS are either length-1 or the same length; that is, length-8 and length-1 works, as does the reverse, but length-2 and length-3 does not. (R's sloppy recycling rules allow even multiples, so
1:2 == 1:4does not produce an error, though in my mind it is a really bad idea to rely on that being interpreted correctly.)So a length-n/length-1 example looks like:
In your case, however, you're effectively testing this:
where the first is length-2, the second is length-3.
So in your case, I think one way you can do it is with this:
Other ways exist, such as using
matchinstead of a full frame/join mindset.(inside your function, replacing the
tibble(.) |> left_join(.) ...expression). This does a lookup/replace usingmatch, but since a name not in the hardcoded dictionary will not match, it will returnNA. To keep the originalincoming_namewhen it doesn't match, wecoalescethe matched string with the original.