How to most efficiently repeat the same merge on different columns in R (preferably data.table)

111 views Asked by At

I have program that needs to repeat a merge which uses the same column from Table A over and over, but changes the column from table B. Doing it with a loop that uses the data.table::merge command repeatedly is quite slow, so I'm wondering if there's a faster way to do it.

For an example:

Imagine a table "A" of fruits, with two columns, "fruit_name" and "price."

And another table "B" of baskets, with 3 columns, "fruit_1", "fruit_2", and "fruit_3"

I would like to get the sum price of the fruits in table B for each row. I could do 3 merges, all using "fruit_name" from the fruit table, and the first using "fruit_1", second "fruit_2", and 3rd "fruit_3" from the baskets table.

Merges take a long time, though. Is there anyway to do this more efficiently computationally? Here is the code laying out the example and getting the desired result, but doing it too slowly.

I generally use data.table and it's generally efficient, so I prefer a data.table solution but open to others if they are faster than 3 merges.

Also, I imagine I could get the data in long format and do one merge, ideally I'd avoid that,because my data makes a lot of sense in wide format, comes in wide, and needs to be exported wide. But if it's definitely the best practice then I guess that's good to know.

Thanks all for your time!

Edit: I chose the answer that I did because for my use case, at least, its limitations are not a problem. And, it is very, very fast.

The more upvoted answer takes a bit longer to do the initial melt step than the chosen one takes to do the whole task. So, for speed alone, I think the chosen answer is best, if it works for you.

If it doesn't, then the most upvoted answer is probably good!

library(data.table)


fruits <- data.table(fruit_name = c('orange', 'apple', 'pear', 'kiwi', 'blueberry')
                     , price = c(1, 1.531, 2.1, 2.25, 3.03)
                     )

baskets <- data.table(fruit_1 = c('orange', 'apple', 'apple', 'pear')
                      ,fruit_2 = c('apple', 'pear', 'kiwi', 'kiwi')
                      ,fruit_3 = c('pear', 'kiwi', 'blueberry', 'blueberry'))

result <- copy (baskets)


result <- merge(result, fruits, by.x = 'fruit_1', by.y = 'fruit_name')
setnames(result, 'price', 'price_1')

result <- merge(result, fruits, by.x = 'fruit_2', by.y = 'fruit_name')
setnames(result, 'price', 'price_2')

result <- merge(result, fruits, by.x = 'fruit_3', by.y = 'fruit_name')
setnames(result, 'price', 'price_3')

result[,price_total := price_1 + price_2 + price_3]
3

There are 3 answers

1
B. Christian Kamgang On BEST ANSWER

Using the function chmatch from data.table, you can solve your problem as follow. If it is still slow, replace chmatch with the function collapse::fmatch and check if there is some performance improvement.

baskets[, price_total := rowSums(sapply(.SD, \(x) fruits$price[chmatch(x, fruits$fruit_name)])), .SDcols=patterns("fruit")]

#    fruit_1 fruit_2   fruit_3 price_total
#     <char>  <char>    <char>       <num>
# 1:  orange   apple      pear       4.631
# 2:   apple    pear      kiwi       5.881
# 3:   apple    kiwi blueberry       6.811
# 4:    pear    kiwi blueberry       7.380
1
Ian Gow On

I am not very familiar with data.table. But the following might give you some ideas (I am using dtplyr so that dplyr works on data.table).

I think that pivoting to a long format and summarising should give you good performance. Packages like DuckDB and rpolars can beat even data.table for performance for this kind of thing (though data.table should be fast).

Note that I am inferring from the summing across rows that the rows represent "baskets".

Getting back to wide form is not difficult.

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
library(tidyr)

fruits <- data.table(fruit_name = c('orange', 'apple', 'pear', 'kiwi', 
                                    'blueberry'), 
                     price = c(1, 1.531, 2.1, 2.25, 3.03)
)

baskets <- data.table(fruit_1 = c('orange', 'apple', 'apple', 'pear')
                      ,fruit_2 = c('apple', 'pear', 'kiwi', 'kiwi')
                      ,fruit_3 = c('pear', 'kiwi', 'blueberry', 'blueberry'))

baskets_long <-
    baskets |>
    mutate(basket_id = row_number()) |>
    pivot_longer(cols = -basket_id, 
                 values_to = "fruit_name") |>
    separate_wider_delim(name, "_", 
                         names = c("category", "pos")) 

merged <-
    baskets_long |>
    inner_join(fruits, by = "fruit_name")

vals <-
    merged |> 
    group_by(basket_id) |>
    summarize(price_total = sum(price))

baskets_long |> 
    pivot_wider(names_from = c("category", "pos"), 
                values_from = "fruit_name", 
                names_glue = "{category}_{pos}") |>
    inner_join(vals)
#> Joining with `by = join_by(basket_id)`
#> # A tibble: 4 × 5
#>   basket_id fruit_1 fruit_2 fruit_3   price_total
#>       <int> <chr>   <chr>   <chr>           <dbl>
#> 1         1 orange  apple   pear             4.63
#> 2         2 apple   pear    kiwi             5.88
#> 3         3 apple   kiwi    blueberry        6.81
#> 4         4 pear    kiwi    blueberry        7.38

Created on 2024-01-11 with reprex v2.0.2

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)

fruits <- data.table(fruit_name = c('orange', 'apple',
                                    'pear', 'kiwi', 'blueberry')
                     , price = c(1, 1.531, 2.1, 2.25, 3.03)
)

get_basket <- function(i) {
  tibble(id = i, 
         fruit = sample(fruits$fruit_name, basket_size, replace = TRUE))
}

basket_size <- 30L
num_baskets <- 100000L
baskets_long <- 
  bind_rows(lapply(1:num_baskets, get_basket)) |>
  group_by(id) |>
  mutate(item_no = row_number()) |>
  ungroup()
  

baskets_tbl <- pivot_wider(baskets_long, id_cols = "id", 
                           values_from = "fruit",
                           names_from = "item_no",
                           names_glue = "fruit_{item_no}")

baskets <- data.table(baskets_tbl)
2
Onyambu On

directly use data.table:

baskets[, melt(.SD, measure.vars = patterns('fruit'),value.name = 'fruit_name')
    ][fruits, on = 'fruit_name'
    ][, price := sum(price), by = rowid(variable)
    ][, dcast(.SD, rowid(variable)+price~variable, value.var = 'fruit_name')]

   variable price fruit_1 fruit_2   fruit_3
1:        1 4.631  orange   apple      pear
2:        2 5.881   apple    pear      kiwi
3:        3 6.811   apple    kiwi blueberry
4:        4 7.380    pear    kiwi blueberry