I have sequencing run information (for 4 nucleotides) that I have extracted from the .ab1 file. I want to be able to fit four multi-peak Gaussian distributions into the data (corresponding to 4 different nucleotides) Data is a csv file with five columns - index columns and four other columns corresponding to reads from the four nucleotides -A,T,G and C.
x=data.frame(read.csv(file.choose()))
smooth1=ksmooth(x$index,x$A,kernel="normal",bandwidth=2)
smooth2=ksmooth(x$index,x$C,kernel="normal",bandwidth=2)
smooth3=ksmooth(x$index,x$G,kernel="normal",bandwidth=2)
smooth4=ksmooth(x$index,x$T,kernel="normal",bandwidth=2)
dsmooth1=diff(smooth1$y)
dsmooth2=diff(smooth2$y)
dsmooth3=diff(smooth3$y)
dsmooth4=diff(smooth4$y)
locmax1<-sign(c(0,dsmooth1))>0 & sign(c(dsmooth1,0))<0
locmax2<-sign(c(0,dsmooth2))>0 & sign(c(dsmooth2,0))<0
locmax3<-sign(c(0,dsmooth3))>0 & sign(c(dsmooth3,0))<0
locmax4<-sign(c(0,dsmooth4))>0 & sign(c(dsmooth4,0))<0
plot(x$index,x$A,xlim=c(900,950))
lines(smooth1)
lines(smooth2,col="green")
lines(smooth3,col="blue")
lines(smooth4,col="red")
points(smooth1$x[locmax1],smooth1$y[locmax1],cex=3,c=2)
points(smooth2$x[locmax2],smooth2$y[locmax2],cex=3,c=2)
points(smooth3$x[locmax3],smooth3$y[locmax3],cex=3,c=2)
points(smooth4$x[locmax4],smooth4$y[locmax4],cex=3,c=2)
Further to locate the peaks, I used the following
peaks=function(x) {
modes=NULL
for ( i in 2:(length(x)-1) ){
if ( (x[i] > x[i-1]) & (x[i] > x[i+1]) ) {
modes=c(modes,i)
}
}
if ( length(modes) == 0 ) {
modes = 'This is a monotonic distribution'
}
return(modes)
}
x$A[peaks(x$A)] #similarly, for T,G and C
Certain points have more than one peak and I need to write a code to find such positions that have peaks for more than one of the Gaussian distributions (corresponds to signal from more than one nucleotide). Is there a way to do it in R ?
You are essentially fitting mixture models to your data: mixtures of four Gaussians. I suggest you read up on those. There are more sophisticated ways of dealing with these than smoothing and detecting peaks (which may depend heavily on your kernel width - so if you do smooth, you should do some sensitivity analysis and check how your results change with different kernels and kernel widths).
The
mixtools
package for R should be useful.