I have two data frames, DF1, DF2, each with two columns (a, b). One column (a) is a unique identifier the other is a column (b) with elements that contain a list. The list contains label names. I would like to search DF2$b elements to see if they are contained inside DF1$b, if so I'd like to create a new column, DF2$c, which takes the identifier in DF1a. The tricky part, is that I'd only like to take the unique identifier if it's the smallest union that exists in the data frame. As some background, this data is from a phylogenetic tree. The DF2 is a subsample of DF1. All tips in DF2 are contained in DF1. I want to compare the nodes of DF2 to DF1 (the node names are different), but I can identify the nodes from the tips that are descendents from each.
It would be easier if I explain with an example:
df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))
df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))
I'd like to create a new column in df2, df2$c, which identifies the smallest list (or node) in df1 that contains df2$b. This new column is made by df1$a (the unique identifier). In the example, df2$c (in order would be)
c("1486,1484,1483,1485,1479,1482,1481,1480")
To take the first two as an example:
df2$a is c("KC792204", "KF150733")
This can be found in df1$b[1], df1$b[2], df1$b[3], or 1486, 1485, or 1484. Since I am looking for the smallest length list, the result is 1486. 1486 is the smallest length list that contains all labels that are searched. The next list in df2$b is c("KC792204", "KF150733", "KC791848"). This result is 1484, since only list 1484 in df$1b contains those three labels.
I have tried:
df2$c <- ifelse(df2$b %in% df1$b, df1$a, 'other')
But I am instead comparing the lists as a whole rather than the elements inside each list. I also need to find the smallest of the lists that contain the searched labels.
Here is an approach, using
data.table, and a helper functionOutput (df2)
Explanation:
df1that indicates the length (l) of the vector inbf) that receives a character vector (k), checks to limit the rows indf1to those for which all elements ofkare found inb, and of these rows, returns theavalue for whichlis minimizedfto each value ofbindf2, assigning the result tocUpdate Aug 4 2022:
The OP has asked if the helper function above might be adjusted so that the
avalue is selected from among the rows indf1where more thanpelements ofdf2$bare found indf1$b.. Here is one possible adjustment to the helper function:Note that since
k %chin% ireturns a logical vector, the mean of this is simply the proportion of elements in k (i.e. df2$b) that are inii.e.df1%b..sapply()still returns a logical vector, because we then ask if that proportion equals or exceedsp. Note that since the default ofpis1, using this function without specifyingpwill return the same result as the original helper function.To select the
avalue from among the rows whether 95% of the elements match, call the modified helper function,f, as before, but setpto 0.95