Here are the dataframes (do not edit this portion)
library(dplyr)
set.seed(123)
id <- rep(c("A", "B", "C"), each = 5)
score <- sample(1:50, 15)
label <- paste(sample(LETTERS, 15 * 5, replace = TRUE), collapse = "")
label <- substring(label, seq(1, 71, by = 5), seq(5, 75, by = 5))
df1 <- data.frame(id, score, label)
df1 <- df1[order(df1$id, df1$score), ]
row.names(df1) <- 1:nrow(df1)
df1
id score label
1 A 3 JMGII
2 A 14 KGULO
3 A 15 ISDNQ
4 A 31 CHZGJ
5 A 42 JWUGU
6 B 25 OZPTF
7 B 26 KHVVG
8 B 37 LMRAY
9 B 43 FYBEH
10 B 48 YFUOI
11 C 5 BDMEV
12 C 9 PLYNC
13 C 27 PQVRQ
14 C 28 NYWCH
15 C 40 SYTVY
id <- rep(c("A", "B", "C"), each = 3)
score <- sample(1:50, 9)
df2 <- data.frame(id, score)
df2 <- df2[order(df2$id, df2$score), ]
df2 <- df2 %>% group_by(id) %>% mutate(order = row_number())
row.names(df2) <- 1:nrow(df2)
df2
# A tibble: 9 × 3
# Groups: id [3]
id score order
* <chr> <int> <int>
1 A 14 1
2 A 29 2
3 A 32 3
4 B 3 1
5 B 7 2
6 B 23 3
7 C 15 1
8 C 21 2
9 C 37 3
I have two identically named columns in two dataframes (df1 and df2).
1st column is 'ID' a factor variable. 2nd column is 'score' a numeric variable.
I want to do an inner join on two conditions: 1) match on ID exactly, 2) for each score on df2 join all observations that have the closest score on df1 (to a difference limit of 5) for that ID.
For the 2nd condition, this means that for each row of df2 it should join with rows of df1 with the lowest difference in score to it, up till a difference of 5.
Example: For the first row of df2, it should only join with row 2 of df1 (difference of 0 as both are 14), and not join row 3 of df1 (difference of 1).
Notes:
I tried various packages like fuzzyjoin, but I can't seem to implement the fuzzy match portion simultaneously, with max_dist = 2, distance_col = "distance".
I've scoured through stackoverflow solutions for similar topics, but none worked so far - they didn't have the same mix of 1 numeric fuzzy and 1 factor exact either..
EDITED
Thanks for providing example data and more details.
I still think that
dplyr::join_by
is useful here: we can specify thatdf2$score
lies betweendf1$score
+/- 5.I don't know if the minimum difference in scores can be specified as part of the join, but it can be calculated after the join using a
filter
.Result, retaining all columns to show that the join worked as expected. You could add a
dplyr::select
to the end of the code to retain and/or rename the desired columns.