How to merge two dataframes on two variables - the first an exact match on factor variable, the second a fuzzy match for numeric variable

Question

How to merge two dataframes on two variables - the first an exact match on factor variable, the second a fuzzy match for numeric variable

61 views Asked by user22746157 At 15 October 2023 at 22:15

Here are the dataframes (do not edit this portion)

library(dplyr)
set.seed(123)
id <- rep(c("A", "B", "C"), each = 5)
score <- sample(1:50, 15)
label <- paste(sample(LETTERS, 15 * 5, replace = TRUE), collapse = "")
label <- substring(label, seq(1, 71, by = 5), seq(5, 75, by = 5))
df1 <- data.frame(id, score, label)
df1 <- df1[order(df1$id, df1$score), ]
row.names(df1) <- 1:nrow(df1)
df1

   id score label
1   A     3 JMGII
2   A    14 KGULO
3   A    15 ISDNQ
4   A    31 CHZGJ
5   A    42 JWUGU
6   B    25 OZPTF
7   B    26 KHVVG
8   B    37 LMRAY
9   B    43 FYBEH
10  B    48 YFUOI
11  C     5 BDMEV
12  C     9 PLYNC
13  C    27 PQVRQ
14  C    28 NYWCH
15  C    40 SYTVY

id <- rep(c("A", "B", "C"), each = 3)
score <- sample(1:50, 9)
df2 <- data.frame(id, score)
df2 <- df2[order(df2$id, df2$score), ]
df2 <- df2 %>% group_by(id) %>% mutate(order = row_number())
row.names(df2) <- 1:nrow(df2)
df2

# A tibble: 9 × 3
# Groups:   id [3]
  id    score order
* <chr> <int> <int>
1 A        14     1
2 A        29     2
3 A        32     3
4 B         3     1
5 B         7     2
6 B        23     3
7 C        15     1
8 C        21     2
9 C        37     3

I have two identically named columns in two dataframes (df1 and df2).

1st column is 'ID' a factor variable. 2nd column is 'score' a numeric variable.

I want to do an inner join on two conditions: 1) match on ID exactly, 2) for each score on df2 join all observations that have the closest score on df1 (to a difference limit of 5) for that ID.

For the 2nd condition, this means that for each row of df2 it should join with rows of df1 with the lowest difference in score to it, up till a difference of 5.

Example: For the first row of df2, it should only join with row 2 of df1 (difference of 0 as both are 14), and not join row 3 of df1 (difference of 1).

Notes:

I tried various packages like fuzzyjoin, but I can't seem to implement the fuzzy match portion simultaneously, with max_dist = 2, distance_col = "distance".

I've scoured through stackoverflow solutions for similar topics, but none worked so far - they didn't have the same mix of 1 numeric fuzzy and 1 factor exact either..

Original Q&A

There are 2 answers

**neilfws** · Answer 1 · 2023-10-15T23:41:24+00:00

EDITED

Thanks for providing example data and more details.

I still think that dplyr::join_by is useful here: we can specify that df2$score lies between df1$score +/- 5.

I don't know if the minimum difference in scores can be specified as part of the join, but it can be calculated after the join using a filter.

library(dplyr)


df1 %>% 
  mutate(score_lower = score - 5, 
         score_upper = score + 5) %>% 
  inner_join(df2, 
             by = join_by(id, between(y$score, x$score_lower, x$score_upper))) %>% 
group_by(id) %>% 
filter(abs(score.x - score.y) == min(abs(score.x - score.y))) %>% 
ungroup()

Result, retaining all columns to show that the join worked as expected. You could add a dplyr::select to the end of the code to retain and/or rename the desired columns.

# A tibble: 3 × 7
  id    score.x label score_lower score_upper score.y order
  <chr>   <int> <chr>       <dbl>       <dbl>   <int> <int>
1 A          14 KGULO           9          19      14     1
2 B          25 OZPTF          20          30      23     3
3 C          40 SYTVY          35          45      37     3

**jared_mamrot** · Answer 2 · 2023-10-17T05:53:44+00:00

I'm not sure if I've misunderstood your desired outcome, but here's a potential solution:

Example data:

library(dplyr)

set.seed(123)
id <- rep(c("A", "B", "C"), each = 5)
score <- sample(1:50, 15)
label <- paste(sample(LETTERS, 15 * 5, replace = TRUE), collapse = "")
label <- substring(label, seq(1, 71, by = 5), seq(5, 75, by = 5))
df1 <- data.frame(id, score, label)
df1 <- df1[order(df1$id, df1$score), ]
row.names(df1) <- 1:nrow(df1)
df1
#>    id score label
#> 1   A     3 JMGII
#> 2   A    14 KGULO
#> 3   A    15 ISDNQ
#> 4   A    31 CHZGJ
#> 5   A    42 JWUGU
#> 6   B    25 OZPTF
#> 7   B    26 KHVVG
#> 8   B    37 LMRAY
#> 9   B    43 FYBEH
#> 10  B    48 YFUOI
#> 11  C     5 BDMEV
#> 12  C     9 PLYNC
#> 13  C    27 PQVRQ
#> 14  C    28 NYWCH
#> 15  C    40 SYTVY

id <- rep(c("A", "B", "C"), each = 3)
score <- sample(1:50, 9)
df2 <- data.frame(id, score)
df2 <- df2[order(df2$id, df2$score), ]
df2 <- df2 %>% group_by(id) %>% mutate(order = row_number())
row.names(df2) <- 1:nrow(df2)
#> Warning: Setting row names on a tibble is deprecated.
df2
#> # A tibble: 9 × 3
#> # Groups:   id [3]
#>   id    score order
#> * <chr> <int> <int>
#> 1 A        14     1
#> 2 A        29     2
#> 3 A        32     3
#> 4 B         3     1
#> 5 B         7     2
#> 6 B        23     3
#> 7 C        15     1
#> 8 C        21     2
#> 9 C        37     3

Potential solution:

df1 %>%
  full_join(df2, by = join_by(id), relationship = "many-to-many") %>%
  group_by(id, score.y) %>%
  filter(abs(score.y - score.x) <= 5 & 
           abs(score.y - score.x) == min(abs(score.y - score.x)))
#> # A tibble: 5 × 5
#> # Groups:   id, score.y [5]
#>   id    score.x label score.y order
#>   <chr>   <int> <chr>   <int> <int>
#> 1 A          14 KGULO      14     1
#> 2 A          31 CHZGJ      29     2
#> 3 A          31 CHZGJ      32     3
#> 4 B          25 OZPTF      23     3
#> 5 C          40 SYTVY      37     3

In the example df2$score == 29 'matches' with df1$score == 31; add the value df1$score == 27 to the dataframe (show that values the same distance from df2$score are both retained):

df1[5,2] <- 27

df1 %>%
  full_join(df2, by = join_by(id), relationship = "many-to-many") %>%
  group_by(id, score.y) %>%
  filter(abs(score.y - score.x) <= 5 & 
           abs(score.y - score.x) == min(abs(score.y - score.x)))
#> # A tibble: 6 × 5
#> # Groups:   id, score.y [5]
#>   id    score.x label score.y order
#>   <chr>   <dbl> <chr>   <int> <int>
#> 1 A          14 KGULO      14     1
#> 2 A          31 CHZGJ      29     2
#> 3 A          31 CHZGJ      32     3
#> 4 A          27 JWUGU      29     2
#> 5 B          25 OZPTF      23     3
#> 6 C          40 SYTVY      37     3

^{Created on 2023-10-17 with reprex v2.0.2}

TechQA.

How to merge two dataframes on two variables - the first an exact match on factor variable, the second a fuzzy match for numeric variable

Here are the dataframes (do not edit this portion)

There are 2 answers

Related Questions in R

Related Questions in FUZZYJOIN

Popular Questions

Popular Tags

Trending Questions