How to semi_join two dataframes by string column with one being colon-separated

496 views Asked by At

I have two dataframes, dfa and dfb:

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10)
)

which look like this:

> dfa
  gene_name id
1     MUC16  1
2      MUC2  2
3       MET  3
4      FAT1  4
5      TERT  5

> dfb
  gene_name id
1      MUC1  6
2 MET; BLEP  7
3     MUC21  8
4       FAT  9
5      TERT 10

dfa is my genes of interest list: I want to keep the dfb rows where they appear, minding the digits (MUC1 is not MUC16). My new_df should look like this:

> new_df
  gene_name id
1 MET; BLEP  7
2      TERT 10

My problem is that the regular dplyr::semi_join() does exact matches, which doesn't take into account the fact that dfb$gene_names can contain genes separated with "; ". Meaning that with this example, "MET" is not retained.

I tried to look into fuzzyjoin::regex_semi_join, but I can't make it do what I want...

A tidyverse solution would be welcome. (Maybe with stringr?!)

EDIT: Follow-up question...

How would I go about to do the reciprocal anti_join? Simply changing semi_join to anti_join in this method doesn't work because the row MET; BLEP is present when it shouldn't be...

Adding a filter(gene_name == new_col) after the anti_join works with the provided simple dataset, but if I twist it a little like this:

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
  id = c(6:10)
)

...then it doesn't anymore. Here and in my real-life dataset, dfa doesn't contain semicolons, it's only one column of individual gene names. But dfb contains a lot of information, and multiple combinations of semicolons...

3

There are 3 answers

0
MonkeyBack On BEST ANSWER

I think I finally managed to make fuzzyjoin::regex_joins do what I want. It was ridiculously simple, I just had to tweak my dfa filter list:

library(fuzzyjoin)

# add "\b" regex expression before/after each gene of the list to filtrate from
# (to search for whole words)
dfa$gene_name <- paste0("\\b", dfa$gene_name, "\\b")

# to keep genes from dfb that are present in the dfa filter list
dfb %>% 
  regex_semi_join(dfa, by = c(gene_name = "gene_name"))

# to exclude genes from dfb that are present in the dfa filter blacklist
dfb %>% 
  regex_anti_join(dfa, by = c(gene_name = "gene_name"))

One drawback though: it's quite slow...

6
Sada93 On

You can use seperate_rows() to split the dataframe before joining. Note that if BLEP existed in dfa, it would result in a duplicate, which is why distinct is used

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5),
  stringsAsFactors = FALSE
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10),
  stringsAsFactors = FALSE
)


library(tidyverse)

dfb%>%
  mutate(new_col = gene_name)%>%
  separate_rows(new_col,sep = "; ")%>%
  semi_join(dfa,by = c("new_col" = "gene_name"))%>%
  select(gene_name,id)%>%
  distinct()


0
Ben G On

Here's a solution using stringr and purrr.

library(tidyverse)

dfb %>%
 mutate(gene_name_list = str_split(gene_name, "; ")) %>%
 mutate(gene_of_interest = map_lgl(gene_name_list, some, ~ . %in% dfa$gene_name)) %>%
 filter(gene_of_interest == TRUE) %>%
 select(gene_name, id)