R calculate summary dataframe from dataframe with multiple columns of information

171 views Asked by At

I have a dataframe with multiple columns of information for example:

df <- data.frame(chr=c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr2", "chr2"), Gene=c("Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Sad", "Sad"), site = c(100, 120, 130, 300, 2000, 2300, 2342, 2451, 120, 123), value=c(20, 25, 21, 30, -80, 31, -79, -90, 10, 13))

> df
    chr  Gene site value
1  chr1 Happy  100    20
2  chr1 Happy  120    25
3  chr1 Happy  130    21
4  chr1 Happy  300    30
5  chr1 Happy 2000   -80
6  chr1 Happy 2300    31
7  chr1 Happy 2342   -79
8  chr1 Happy 2451   -90
9  chr2   Sad  120    10
10 chr2   Sad  123    13

I would like to create a summary dataframe that calculates for each Gene how many clustered regions there are. I consider a cluster any number of rows where the difference in the site number is no greater than 1,000 (my data is sorted by chr and sites). To start I created a new column to calculate the distance between sites in successive rows using:

df$Distance <- c(1001, diff(df$site, lag=1, differences=1))

> df
    chr  Gene site value Distance
1  chr1 Happy  100    20     1001
2  chr1 Happy  120    25       20
3  chr1 Happy  130    21       10
4  chr1 Happy  300    30      170
5  chr1 Happy 2000   -80     1700
6  chr1 Happy 2300    31      300
7  chr1 Happy 2342   -79       42
8  chr1 Happy 2451   -90      109
9  chr2   Sad  120    10    -2331
10 chr2   Sad  123    13        3

I would like to create a summary table with a row for each gene that summarizes how many clusters are found within each gene where the average value is either positive or negative. In the above example the table would look like:

   Gene PositiveClusters NegativeClusters
1 Happy                1                1
2   Sad                1                0
1

There are 1 answers

0
jlhoward On

Here's a data.table solution - but I have a feeling there's a more efficient way...

library(data.table)
setDT(df)[,cluster:=c(0,cumsum(diff(site)>1000)),by=Gene]
df[,mean:=mean(value),by=list(Gene,cluster)]
df[,list(pos=length(unique(cluster[mean>=0])),
         neg=length(unique(cluster[mean<0]))),by=Gene]
#     Gene pos neg
# 1: Happy   1   1
# 2:   Sad   1   0

So this converts df to a data.table and adds a column cluster based on the cumsum(diff(site)>1000), grouped by Gene. This is a very typical pattern for generating grouping variables.

Then we add a column mean which is mean(value) grouped by both Gene and cluster.

Then we create a new data.table that has the counts of each cluster type for mean either positive (>= 0) or negative (< 0), grouped by Gene.