Identifying list elements contained in another list that are both elements of a data frame

Question

Identifying list elements contained in another list that are both elements of a data frame

63 views Asked by aholtz At 26 July 2022 at 16:40

I have two data frames, DF1, DF2, each with two columns (a, b). One column (a) is a unique identifier the other is a column (b) with elements that contain a list. The list contains label names. I would like to search DF2$b elements to see if they are contained inside DF1$b, if so I'd like to create a new column, DF2$c, which takes the identifier in DF1a. The tricky part, is that I'd only like to take the unique identifier if it's the smallest union that exists in the data frame. As some background, this data is from a phylogenetic tree. The DF2 is a subsample of DF1. All tips in DF2 are contained in DF1. I want to compare the nodes of DF2 to DF1 (the node names are different), but I can identify the nodes from the tips that are descendents from each.

It would be easier if I explain with an example:

df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))

df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))

I'd like to create a new column in df2, df2$c, which identifies the smallest list (or node) in df1 that contains df2$b. This new column is made by df1$a (the unique identifier). In the example, df2$c (in order would be)

c("1486,1484,1483,1485,1479,1482,1481,1480")

To take the first two as an example:

df2$a is c("KC792204", "KF150733")

This can be found in df1$b[1], df1$b[2], df1$b[3], or 1486, 1485, or 1484. Since I am looking for the smallest length list, the result is 1486. 1486 is the smallest length list that contains all labels that are searched. The next list in df2$b is c("KC792204", "KF150733", "KC791848"). This result is 1484, since only list 1484 in df$1b contains those three labels.

I have tried:

df2$c <- ifelse(df2$b %in% df1$b, df1$a, 'other')

But I am instead comparing the lists as a whole rather than the elements inside each list. I also need to find the smallest of the lists that contain the searched labels.

Original Q&A

There are 2 answers

jblood94 On 26 July 2022 at 17:45

Here is one option:

library(data.table) # for %chin%

df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))

df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))

df2$c <- df1$a[
  Rfast::colMaxs(
    outer(
      seq_along(df1$b),
      seq_along(df2$b),
      function(i, j) mapply(
        function(x, y) all(y %chin% x),
        df1$b[i],
        df2$b[j]
      )
    )/lengths(df1$b)
  )
]
df2$c
#> [1] 1486 1484 1483 1485 1479 1482 1481 1480

If it's possible for a row to have no match, then the above should be modified:

m <- outer(
  seq_along(df1$b),
  seq_along(df2$b),
  function(i, j) mapply(
    function(x, y) all(y %chin% x),
    df1$b[i],
    df2$b[j]
  )
)
df2$c <- ifelse(colSums(m) == 0L, NA, df1$a[Rfast::colMaxs(m/lengths(df1$b))])

**langtang** · Accepted Answer · 2022-07-26T17:56:18+00:00

Here is an approach, using data.table, and a helper function

library(data.table)
setDT(df1)[, l:=sapply(b,length)]
f <- function(k) df1[sapply(df1$b,\(i) all(k %chin% i))][l==min(l),a]
setDT(df2)[, c:=sapply(b,f)]

Output (df2)

       a                                   b     c
   <num>                              <list> <num>
1:  8645                   KC792204,KF150733  1486
2:  1247          KC792204,KC792205,KC791848  1484
3:  5879                   KJ201900,KF535207  1483
4:  1548 KC792204,JX987740,KX148108,JX987724  1485
5:  2487 GU731662,GU731661,MT826960,GU731663  1479
6:  1245 KJ201900,KJ201899,AB817119,AB817100  1482
7:  1247          GU731661,KP319229,KY428876  1481
8:  3695                   GU731662,MT826960  1480

Explanation:

Line 1: load library
Line 2: adds a column to df1 that indicates the length (l) of the vector in b
Line 3: defines helper function (f) that receives a character vector (k), checks to limit the rows in df1 to those for which all elements of k are found in b, and of these rows, returns the a value for which l is minimized
Line 4: applies f to each value of b in df2, assigning the result to c

Update Aug 4 2022:

The OP has asked if the helper function above might be adjusted so that the a value is selected from among the rows in df1 where more than p elements of df2$b are found in df1$b.. Here is one possible adjustment to the helper function:

f <- function(k, p=1) {
  df1[sapply(df1$b,\(i) mean(k %chin% i)>=p)][l==min(l),a]
}

Note that since k %chin% i returns a logical vector, the mean of this is simply the proportion of elements in k (i.e. df2$b) that are in i i.e. df1%b.. sapply() still returns a logical vector, because we then ask if that proportion equals or exceeds p. Note that since the default of p is 1, using this function without specifying p will return the same result as the original helper function.

To select the a value from among the rows whether 95% of the elements match, call the modified helper function, f, as before, but set p to 0.95

setDT(df2)[, c:=sapply(b,f, p=0.95)]

TechQA.

Identifying list elements contained in another list that are both elements of a data frame

There are 2 answers

Update Aug 4 2022:

Related Questions in R

Related Questions in IF-STATEMENT

Related Questions in DPLYR

Related Questions in PHYLOGENY

Related Questions in APE

Popular Questions

Trending Questions