Extract intersection list from upset object

5.9k views Asked by At

I'm making some comparisons with UpSetR, and I'd like to save the lists of elements that fall into each intersection. Is this possible? I can't find it anywhere...

It would be pretty tedious to do it manually (many lists), and since they're calculated anyway not being able to save them is frustrating

3

There are 3 answers

0
zx8754 On BEST ANSWER

There is no ready upSetR function for this (yet). But, it is possible to extract it:

library(UpSetR)

# Example input as list, expected output is 1 and 5:
listInput <- list(one = c(1, 2, 3, 5, 7, 8, 11, 12, 13), 
                  two = c(1, 2, 4, 5, 10),
                  three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))

When assigned upset returns a value, which also includes the data:

x <- upset(fromList(listInput))
x$New_data
#    one two three
# 1    1   1     1
# 2    1   1     0
# 3    1   0     0
# 4    1   1     1
# 5    1   0     1
# 6    1   0     1
# 7    1   0     0
# 8    1   0     1
# 9    1   0     1
# 10   0   1     0
# 11   0   1     1
# 12   0   0     1
# 13   0   0     1

From here we can see it is 1st and the 4th rows are found in all three sets. The order of items are defined based on the order they appear in the list, see:

x1 <- unlist(listInput, use.names = FALSE)
x1 <- x1[ !duplicated(x1) ]
x1
# [1]  1  2  3  5  7  8 11 12 13  4 10  6  9

Now we know the rownumbers from "New_data" refer to in our list. So, as we have 3 columns, filter rows where sum is 3:

x1[ rowSums(x$New_data) == 3 ]
# [1] 1 5

Or we could just use Reduce:

Reduce(intersect, listInput)
# [1] 1 5
1
Javier Herrero On

Here is my take at extracting the different intersections together with the list of elements in them.

The main idea is to paste all the 0's and 1's from the binary table to create unique identifiers for each intersection and them use the dplyr::group_by function to extract the information

data <- data.frame(
  entry = paste0("Entry.", 1:10),
  "A" = c(0,0,0,0,1,0,1,1,0,0),
  "B" = c(1,0,0,0,1,1,1,1,1,0),
  "C" = c(1,1,1,1,0,0,1,0,1,1)
)

# NOT REQUIRED. Only to confirm that upset works with these data
upset(data)

You can then identify the intersections by pasting all the binary columns. I use the unite convenience function for this:

NB: you may have to change this depending on whether your data has row names or a column with names

data_with_intersection <- data %>%
  unite(col = "intersection", -c("entry"), sep = "")

From here, you can easily calculate the size of each intersection:

# Table of intersections and the number of entries
data_with_intersection %>%
  group_by(intersection) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

Or even extract the list of entries/elements in each intersection:

# List of intersections and their entries
data_with_intersection %>%
  group_by(intersection) %>%
  summarise(list = list(entry)) %>%
  mutate(list = setNames(list, intersection)) %>%
  pull(list)
1
edoardo pedrini On

Here is my solution. It is bio-related, but should be easy to translate to other fields. I start with a list of vectors. In my case a list of genes (list of characters) belonging to different signatures (different sets).

str(list_filter)
List of 9
 $ CellAge_Induces         : chr [1:153] "AAK1" "ABI3" "ADCK5" "AGT" ...
 $ CellAge_Inhibits        : chr [1:121] "ACLY" "AKR1B1" "ASPH" "ATF7IP" ...
 $ CLASSICAL_SASP          : chr [1:38] "BGN" "CCL2" "CCL20" "COL1A1" ...
 $ FRIDMAN_SENESCENCE_UP   : chr [1:77] "ALDH1A3" "CCND1" "CD44" "CDKN1A" ...
 $ ISM_SCORE               : chr [1:128] "HSH2D" "OTOF" "TRIM69" "PSME1" ...
 $ MOSERLE_IFNA_RESPONSE   : chr [1:31] "CD274" "CMPK2" "CXCL10" "DDX58" ...
 $ REACTOME_SENESCENCE_SASP: chr [1:110] "ANAPC1" "ANAPC10" "ANAPC11" "ANAPC15" ...
 $ SAEPHIA_CURATED_SASP    : chr [1:38] "IL1A" "IL1B" "CXCL10" "CXCL1" ...
 $ senmayo                 : chr [1:125] "ACVR1B" "ANG" "ANGPT1" "ANGPTL4" ...

From this list, I generate two tables: One with the unique gene names

df2 <- data.frame(gene=unique(unlist(list_filter)))

head(df2)
     gene
1    AAK1
2    ABI3
3   ADCK5
4     AGT
5    AKT1
6 ALOX15B

dim(df2)
[1] 671   1

One is simply a "dataframe" version of the list. With every gene in the signature and the name of every signature (set).

df1 <- lapply(list_filter,function(x){
  data.frame(gene = x)
}) %>% 
  bind_rows(.id = "path")

head(df1)
             path    gene
1 CellAge_Induces    AAK1
2 CellAge_Induces    ABI3
3 CellAge_Induces   ADCK5
4 CellAge_Induces     AGT
5 CellAge_Induces    AKT1
6 CellAge_Induces ALOX15B

dim(df1)
[1] 821   2

now I iterate the search of each unique gene name and save the identity of the signatures in a column.

df_int <- lapply(df2$gene,function(x){
  # pull the name of the intersections
  intersection <- df1 %>% 
    dplyr::filter(gene==x) %>% 
    arrange(path) %>% 
    pull("path") %>% 
    paste0(collapse = "|")
  
  # build the dataframe
  data.frame(gene = x,int = intersection)
}) %>% 
  bind_rows()

head(df_int,n=20)
      gene                                                            int
1     AAK1                                                CellAge_Induces
2     ABI3                                                CellAge_Induces
3    ADCK5                                                CellAge_Induces
4      AGT                                                CellAge_Induces
5     AKT1                                                CellAge_Induces
6  ALOX15B                                                CellAge_Induces
7       AR                                                CellAge_Induces
8   ARPC1B                                                CellAge_Induces
9    ASF1A                                                CellAge_Induces
10     AXL                                        CellAge_Induces|senmayo
11 BHLHE40                                                CellAge_Induces
12     BLK                                                CellAge_Induces
13    BRAF                                                CellAge_Induces
14    BRD7                                                CellAge_Induces
15    CAV1                                                CellAge_Induces
16   CCND1                          CellAge_Induces|FRIDMAN_SENESCENCE_UP
17   CDK18                                                CellAge_Induces
18  CDKN1A CellAge_Induces|FRIDMAN_SENESCENCE_UP|REACTOME_SENESCENCE_SASP
19  CDKN1C                          CellAge_Induces|FRIDMAN_SENESCENCE_UP
20  CDKN1B                       CellAge_Induces|REACTOME_SENESCENCE_SASP

dim(df_int)
[1] 671   2

the dataframe can be summarised and compared to the output provided by calling

df_int %>% 
  group_by(int) %>% 
  summarise(n=n()) %>% 
  arrange(desc(n))
# A tibble: 47 × 2
   int                                 n
   <chr>                           <int>
 1 CellAge_Induces                   126
 2 CellAge_Inhibits                  110
 3 REACTOME_SENESCENCE_SASP           95
 4 ISM_SCORE                          93
 5 senmayo                            77
 6 FRIDMAN_SENESCENCE_UP              44
 7 ISM_SCORE|MOSERLE_IFNA_RESPONSE    27
 8 CLASSICAL_SASP|senmayo             12
 9 CLASSICAL_SASP                      8
10 SAEPHIA_CURATED_SASP                8
# … with 37 more rows
# ℹ Use `print(n = ...)` to see more rows


upset(fromList(list_filter),nsets = 10) 

enter image description here