I have a dataframe with multiple columns of information for example:
df <- data.frame(chr=c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr2", "chr2"), Gene=c("Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Sad", "Sad"), site = c(100, 120, 130, 300, 2000, 2300, 2342, 2451, 120, 123), value=c(20, 25, 21, 30, -80, 31, -79, -90, 10, 13))
> df
chr Gene site value
1 chr1 Happy 100 20
2 chr1 Happy 120 25
3 chr1 Happy 130 21
4 chr1 Happy 300 30
5 chr1 Happy 2000 -80
6 chr1 Happy 2300 31
7 chr1 Happy 2342 -79
8 chr1 Happy 2451 -90
9 chr2 Sad 120 10
10 chr2 Sad 123 13
I would like to create a summary dataframe that calculates for each Gene how many clustered regions there are. I consider a cluster any number of rows where the difference in the site number is no greater than 1,000 (my data is sorted by chr and sites). To start I created a new column to calculate the distance between sites in successive rows using:
df$Distance <- c(1001, diff(df$site, lag=1, differences=1))
> df
chr Gene site value Distance
1 chr1 Happy 100 20 1001
2 chr1 Happy 120 25 20
3 chr1 Happy 130 21 10
4 chr1 Happy 300 30 170
5 chr1 Happy 2000 -80 1700
6 chr1 Happy 2300 31 300
7 chr1 Happy 2342 -79 42
8 chr1 Happy 2451 -90 109
9 chr2 Sad 120 10 -2331
10 chr2 Sad 123 13 3
I would like to create a summary table with a row for each gene that summarizes how many clusters are found within each gene where the average value is either positive or negative. In the above example the table would look like:
Gene PositiveClusters NegativeClusters
1 Happy 1 1
2 Sad 1 0
Here's a data.table solution - but I have a feeling there's a more efficient way...
So this converts
df
to a data.table and adds a columncluster
based on thecumsum(diff(site)>1000)
, grouped byGene
. This is a very typical pattern for generating grouping variables.Then we add a column
mean
which ismean(value)
grouped by bothGene
andcluster
.Then we create a new data.table that has the counts of each cluster type for mean either positive (>= 0) or negative (< 0), grouped by
Gene
.