using tapply in tapply

112 views Asked by At

I have a big data.frame with genomic data. The data looks like this- colnames(df)=c("id","chr","start","end","log2") where id is the sample name, chr is the number of the chromosome, start and end give me the location on the chromosome, and log2 is how high/low was the read in that position.

Because there is a lot of data, and it's hard to understand what's going on, I'm trying to go over each sample (id), and for each chromosome (chr) I want to calculate the median of log2 in segments, let say all the reads that are between 1 to 10^7, 1+10^7 to 2*10^7 and so on.

the result should be a new data.frame, for each sample and each chromosome I should have several rows, with start and end indicating what segment I am in, and the last value will be the median of that segment.

I think I need to use tapply() and go over samples, and in it tapply() and go over the chromosomes, and then in each chromosome, a loop over "start" position? (let say I care only if the start coordinate is in the range) Not sure exactly how to approach this.

Any hints, tips, directions will be very appreciated.

Reproducible example-

# fabricated data, 4 samples
# 24 chromosomes in each sample
# 61 ranges in each chromosome

df <- data.frame(id = rep(c('F1','F2','M1','M2'), each = 24*61), 
                 chr = rep(rep(c(1:22,'x','y'), each = 61),4),
                 start = rep(seq(1,25*10^6 - 99, length.out = 61),times = 24*4),
                 end = rep(seq(100,25*10^6, length.out = 61),times = 24*4),
                 log2 = rnorm(4*24*61))

# output should look something like this-
id      chr     start    end       median_log_2
"F1"    "1"     1        8000000   0.002
"F1"    "1"     8000001  16000000  0.00089
"F1"    "1"     16000001 24000000  -0.0011
"F1"    "1"     24000000 25000000  0.108
"F1"    "2"     1        8000000   -0.0012
"F1"    "2"     8000001  16000000  0.0089
"F1"    "2"     16000001 24000000  0.00311
"F1"    "2"     24000000 25000000  0.0128
...
...
1

There are 1 answers

0
T.G. On BEST ANSWER
median_data <- tapply(df$log2, 
                      list(df$id, 
                           df$chr, 
                           cut(df$start, c(0,8*10^6,1.6*10^7,2.4*10^7,3.2*10^7,4*10^8))),
                      median)
median_data <- as.data.frame.table(median_data)

Did the job. (The output is not in the right format, but it's close enogh for me)

In tapply(), you can subset by more than one argument, using list().