count values in one dataframe in another dataframe

91 views Asked by At

Assume I have a dataframe like this:

df <- data.frame(
  D = c('A', 'B', 'C'),
  Q = c('asd', 'reg', 'rt'),
  id = I(list(c(124, 532, 78), c(1, 3, 532), c(2, 3, 78, 124, 1)))
)

and another one like this:

clusters <- data.frame(
 g = I(list(c(124, 78, 1),c(2, 3, 89),c(532, 533)))
)

I want to count the times the elements in each id list are in the groups g1, g2, g3 (each row in clusters) when it's more than once. So basically count the intersection whenever the intersection > 1.

The output should be:

out <- data.frame(
  D = c('A', 'B', 'C'),
  Q = c('asd', 'reg', 'rt'),
  id = I(list(c(124, 532, 78), c(1, 3, 532), c(2,3,78, 124, 1))),
  count = c(2, 0, 5)
)

I cannot do it with for loop cause it will take forever as I have 70k rows.

A for loop would look like this:

for row in df:
  for group in clusters:
     if group intersect row$id > 1
        count=count+intersection
  count=0
3

There are 3 answers

6
ThomasIsCoding On BEST ANSWER

Probably you can try base R

transform(
    df,
    count = rowSums(outer(id, clusters$g, \(...) mapply(\(...) {
        l <- length(intersect(...))
        l * (l > 1)
    }, ...)))
)

or dplyr

df %>%
    left_join(df %>%
        unnest(id) %>%
        left_join(clusters %>%
            mutate(grp = row_number()) %>%
            unnest(g), by = join_by(id == g)) %>%
        summarise(count = {
            d <- table(grp)
            sum(d[d > 1])
        }, .by = c(D, Q)))

which gives

  D   Q           id count
1 A asd 124, 532, 78     2
2 B reg    1, 3, 532     0
3 C  rt 2, 3, 78....     5
0
r2evans On

There's no way around some bespoke code.

sapply(df$id, function(id1) {
  lens <- sapply(clusters$g, function(z) length(intersect(z, id1)))
  sum(replace(lens, lens == 1, 0))
})
# [1] 2 0 5

df <- structure(list(D = c("A", "B", "C"), Q = c("asd", "reg", "rt"), id = structure(list(c(124, 532, 78), c(1, 3, 532), c(2, 3, 78, 124, 1)), class = "AsIs")), class = "data.frame", row.names = c(NA, -3L))
0
M-- On

Here's another tidyverse solution, but you're probably better off using base solutions provided for performance.

library(tidyverse)

df %>% 
  unnest(id) %>% 
  left_join({clusters %>% 
      rownames_to_column("r") %>% 
      unnest(g)}, by = c("id" = "g")) %>% 
  add_count(D, Q, r) %>% 
  summarise(id = list(id), n = first(n * (n > 1)), .by = c(D, Q, r)) %>% 
  summarise(id = list(unlist(id)), count = sum(n) , .by = c(D, Q)) %>% 
  as.data.frame()

#>   D   Q               id count
#> 1 A asd     124, 78, 532     2
#> 2 B reg        1, 3, 532     0
#> 3 C  rt 2, 3, 78, 124, 1     5

Created on 2024-01-22 with reprex v2.0.2