I have a big data.frame
with genomic data.
The data looks like this-
colnames(df)=c("id","chr","start","end","log2")
where id is the sample name, chr is the number of the chromosome, start and end give me the location on the chromosome, and log2 is how high/low was the read in that position.
Because there is a lot of data, and it's hard to understand what's going on, I'm trying to go over each sample (id), and for each chromosome (chr) I want to calculate the median of log2 in segments, let say all the reads that are between 1 to 10^7, 1+10^7 to 2*10^7 and so on.
the result should be a new data.frame
, for each sample and each chromosome I should have several rows, with start and end indicating what segment I am in, and the last value will be the median of that segment.
I think I need to use tapply()
and go over samples, and in it tapply()
and go over the chromosomes, and then in each chromosome, a loop over "start" position? (let say I care only if the start coordinate is in the range)
Not sure exactly how to approach this.
Any hints, tips, directions will be very appreciated.
Reproducible example-
# fabricated data, 4 samples
# 24 chromosomes in each sample
# 61 ranges in each chromosome
df <- data.frame(id = rep(c('F1','F2','M1','M2'), each = 24*61),
chr = rep(rep(c(1:22,'x','y'), each = 61),4),
start = rep(seq(1,25*10^6 - 99, length.out = 61),times = 24*4),
end = rep(seq(100,25*10^6, length.out = 61),times = 24*4),
log2 = rnorm(4*24*61))
# output should look something like this-
id chr start end median_log_2
"F1" "1" 1 8000000 0.002
"F1" "1" 8000001 16000000 0.00089
"F1" "1" 16000001 24000000 -0.0011
"F1" "1" 24000000 25000000 0.108
"F1" "2" 1 8000000 -0.0012
"F1" "2" 8000001 16000000 0.0089
"F1" "2" 16000001 24000000 0.00311
"F1" "2" 24000000 25000000 0.0128
...
...
Did the job. (The output is not in the right format, but it's close enogh for me)
In
tapply()
, you can subset by more than one argument, usinglist()
.