How to create a hardcoded grammar-correcting function such that you give it a char tibble column and it'll give you a corrected version?

24 views Asked by At

The setup

I am using R and the Tidverse, and I am writing pure functions, Tidyverse style.

I have a relational database where the IDs that link the databases are, unfortunately, written in different formats. For example "Fries, french" versus "French fries". To solve this problem, I want a function that standardizes the names, so that I can use it like this:

tibble_a_with_ids_written_well <-
  tibble_a_with_ids_written_incorrectly |>
  mutate(id = id |> name_standardizer)

Afterwards I'd perform a join:

tibble_a_with_ids_written_well |>
  left_join(tibble_b, by = "id")

The name_standardizer function contains a hard-coded tibble with 2 columns that I seek to manually update every time I encounter a new way of writing something down. With that in mind, here's how I am writing the function:

name_standardizer <- function(incoming_name){
  hardcoded_dictionary <-
    tribble(~incorrect_name, ~correct_name,
            "Fries, french", "French fries",
            "Hamborgar", "Hamburger")
  
  hardcoded_dictionary |>
    filter(
      incoming_name == incorrect_name |
        incoming_name == correct_name) |>
    pull(correct_name)
}

The problem

This works well so far, but the instant I use this function with another database I get an error. Here's an example of a third tibble:

tibble_c_with_ids_written_incorrectly <-
  tribble(~id, ~health_rating,
          "Burger", 3.8)

I can then update the function's dictionary:

hardcoded_dictionary <-
  tribble(~incorrect_name, ~correct_name,
          "Fries, french", "French fries",
          "Hamborgar", "Hamburger",
          "Burger", NA) |> # NAs and fill mean I write down the correct name less.
  fill(correct_name) # This reduces human error.

Then, if I run: tibble_c_with_ids_written_incorrectly |> mutate(id = id |> name_standardizer()), everthing works well. However, with this new name_standardizer dictionary, I can no longer standardize tibble_a. Here's the error message:

> tibble_a_with_ids_written_well <-
+   tibble_a_with_IDs_written_incorrectly |>
+   mutate(id = id |> name_standardizer())
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `id = name_standardizer(id)`.
Caused by warning:
! There were 2 warnings in `filter()`.
The first warning was:
ℹ In argument: `incoming_name == incorrect_name | incoming_name == correct_name`.
Caused by warning in `incoming_name == incorrect_name`:
! longer object length is not a multiple of shorter object length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. 

My intuition is that I am not dealing successfully with vectorized functions. There is something that I am missing here.

What can I change in my function so that I can have a hardcoded dictionary and be able to use my function in a versatile way with many different tibbles?

Reproducible example

Here's my code so that you can have a reproducible example:


library(tidyverse)

tibble_a_with_IDs_written_incorrectly <-
  tribble(~id, ~price,
          "Fries, french", 2,
          "Hamborgar", 7)


name_standardizer <- function(incoming_name){
  hardcoded_dictionary <-
    tribble(~incorrect_name, ~correct_name,
            "Fries, french", "French fries",
            "Hamborgar", "Hamburger",
            "Burger", NA) |> # NAs and fill mean I write down the correct name less.
    fill(correct_name) # This reduces human error.
  
  hardcoded_dictionary |>
    filter(
      incoming_name == incorrect_name |
        incoming_name == correct_name) |>
    pull(correct_name)
}

tibble_a_with_ids_written_well <-
  tibble_a_with_IDs_written_incorrectly |>
  mutate(id = id |> name_standardizer())

tibble_b <-
  tribble(~id, ~amount_in_stock,
          "French fries", 4,
          "Hamburger", 3)

tibble_a_with_ids_written_well |>
  left_join(tibble_b, by = "id")

tibble_c_with_ids_written_incorrectly <-
  tribble(~id, ~health_rating,
          "Burger", 3.8)

tibble_c_with_ids_written_incorrectly |>
  mutate(id = id |> name_standardizer())

I have tried giving the function the whole tibble, instead of using mutate on a column. That also did not work.

2

There are 2 answers

0
r2evans On

== does pairwise tests of equality, so in

1:3 == 5:7

It is internally doing

c(1 == 5, 2 == 6, 3 == 7)

This pairwise operation works perfectly well when both the LHS and RHS are either length-1 or the same length; that is, length-8 and length-1 works, as does the reverse, but length-2 and length-3 does not. (R's sloppy recycling rules allow even multiples, so 1:2 == 1:4 does not produce an error, though in my mind it is a really bad idea to rely on that being interpreted correctly.)

So a length-n/length-1 example looks like:

1:3 == 2
c(1 == 2, 2 == 2, 3 == 2)

In your case, however, you're effectively testing this:

c("Fries, french", "Hamborgar") == c("Fries, french", "Hamborgar", "Burger")

where the first is length-2, the second is length-3.

So in your case, I think one way you can do it is with this:

name_standardizer <- function(incoming_name){
  hardcoded_dictionary <-
    tribble(~incorrect_name, ~correct_name,
            "Fries, french", "French fries",
            "Hamborgar", "Hamburger",
            "Burger", NA) |> # NAs and fill mean I write down the correct name less.
    fill(correct_name) # This reduces human error.
  
  tibble(incorrect_name = incoming_name) |>
    left_join(hardcoded_dictionary, by = "incorrect_name") |> 
    mutate(correct_name = coalesce(correct_name, incorrect_name)) |> 
    pull(correct_name)
}
tibble_a_with_ids_written_well <-
  tibble_a_with_IDs_written_incorrectly |>
  mutate(id = id |> name_standardizer())
tibble_a_with_ids_written_well
# # A tibble: 2 × 2
#   id           price
#   <chr>        <dbl>
# 1 French fries     2
# 2 Hamburger        7

Other ways exist, such as using match instead of a full frame/join mindset.

coalesce(
  hardcoded_dictionary$correct_name[ match(incoming_name, hardcoded_dictionary$incorrect_name)],
  incoming_name
)

(inside your function, replacing the tibble(.) |> left_join(.) ... expression). This does a lookup/replace using match, but since a name not in the hardcoded dictionary will not match, it will return NA. To keep the original incoming_name when it doesn't match, we coalesce the matched string with the original.

0
Onyambu On

Consider writing the `your function as follows:

name_standardizer <- function(incoming_name) {
  hardcoded_dictionary <-
    tribble(~incorrect_name, ~correct_name,
            "Fries, french", "French fries",
            "Hamborgar", "Hamburger",
            "Burger", NA) |> 
    fill(correct_name) #
  recode(incoming_name, !!!deframe(hardcoded_dictionary))
}

Now run:

tibble_a_with_IDs_written_incorrectly %>%
   mutate(id = id |> name_standardizer())

# A tibble: 2 × 2
  id           price
  <chr>        <dbl>
1 French fries     2
2 Hamburger        7

tibble_b  %>%
  mutate(id = id |> name_standardizer())

# A tibble: 2 × 2
  id           amount_in_stock
  <chr>                  <dbl>
1 French fries               4
2 Hamburger                  3


tibble_c_with_ids_written_incorrectly %>%
   mutate(id = id |> name_standardizer())

# A tibble: 1 × 2
  id        health_rating
  <chr>             <dbl>
1 Hamburger           3.8