I have two dataframes, dfa and dfb:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
id = c(6:10)
)
which look like this:
> dfa
gene_name id
1 MUC16 1
2 MUC2 2
3 MET 3
4 FAT1 4
5 TERT 5
> dfb
gene_name id
1 MUC1 6
2 MET; BLEP 7
3 MUC21 8
4 FAT 9
5 TERT 10
dfa is my genes of interest list: I want to keep the dfb rows where they appear, minding the digits (MUC1 is not MUC16). My new_df should look like this:
> new_df
gene_name id
1 MET; BLEP 7
2 TERT 10
My problem is that the regular dplyr::semi_join() does exact matches, which doesn't take into account the fact that dfb$gene_names can contain genes separated with "; ". Meaning that with this example, "MET" is not retained.
I tried to look into fuzzyjoin::regex_semi_join, but I can't make it do what I want...
A tidyverse solution would be welcome. (Maybe with stringr?!)
EDIT: Follow-up question...
How would I go about to do the reciprocal anti_join? Simply changing semi_join to anti_join in this method doesn't work because the row MET; BLEP is present when it shouldn't be...
Adding a filter(gene_name == new_col) after the anti_join works with the provided simple dataset, but if I twist it a little like this:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
id = c(6:10)
)
...then it doesn't anymore. Here and in my real-life dataset, dfa doesn't contain semicolons, it's only one column of individual gene names. But dfb contains a lot of information, and multiple combinations of semicolons...
I think I finally managed to make
fuzzyjoin::regex_joinsdo what I want. It was ridiculously simple, I just had to tweak mydfafilter list:One drawback though: it's quite slow...