using tapply in tapply

Question

using tapply in tapply

114 views Asked by T.G. At 02 January 2017 at 08:56

I have a big data.frame with genomic data. The data looks like this- colnames(df)=c("id","chr","start","end","log2") where id is the sample name, chr is the number of the chromosome, start and end give me the location on the chromosome, and log2 is how high/low was the read in that position.

Because there is a lot of data, and it's hard to understand what's going on, I'm trying to go over each sample (id), and for each chromosome (chr) I want to calculate the median of log2 in segments, let say all the reads that are between 1 to 10^7, 1+10^7 to 2*10^7 and so on.

the result should be a new data.frame, for each sample and each chromosome I should have several rows, with start and end indicating what segment I am in, and the last value will be the median of that segment.

I think I need to use tapply() and go over samples, and in it tapply() and go over the chromosomes, and then in each chromosome, a loop over "start" position? (let say I care only if the start coordinate is in the range) Not sure exactly how to approach this.

Any hints, tips, directions will be very appreciated.

Reproducible example-

# fabricated data, 4 samples
# 24 chromosomes in each sample
# 61 ranges in each chromosome

df <- data.frame(id = rep(c('F1','F2','M1','M2'), each = 24*61), 
                 chr = rep(rep(c(1:22,'x','y'), each = 61),4),
                 start = rep(seq(1,25*10^6 - 99, length.out = 61),times = 24*4),
                 end = rep(seq(100,25*10^6, length.out = 61),times = 24*4),
                 log2 = rnorm(4*24*61))

# output should look something like this-
id      chr     start    end       median_log_2
"F1"    "1"     1        8000000   0.002
"F1"    "1"     8000001  16000000  0.00089
"F1"    "1"     16000001 24000000  -0.0011
"F1"    "1"     24000000 25000000  0.108
"F1"    "2"     1        8000000   -0.0012
"F1"    "2"     8000001  16000000  0.0089
"F1"    "2"     16000001 24000000  0.00311
"F1"    "2"     24000000 25000000  0.0128
...
...

Original Q&A

There are 1 answers

**T.G.** · Accepted Answer · 2017-01-02T11:56:51+00:00

median_data <- tapply(df$log2, 
                      list(df$id, 
                           df$chr, 
                           cut(df$start, c(0,8*10^6,1.6*10^7,2.4*10^7,3.2*10^7,4*10^8))),
                      median)
median_data <- as.data.frame.table(median_data)

Did the job. (The output is not in the right format, but it's close enogh for me)

In tapply(), you can subset by more than one argument, using list().

TechQA.

using tapply in tapply

There are 1 answers

Related Questions in R

Related Questions in TAPPLY

Popular Questions

Popular Tags

Trending Questions