Efficient code to map genotype matrix in R

Question

Efficient code to map genotype matrix in R

160 views Asked by user36302 At 01 September 2017 at 18:34

Hi I want to convert a matrix of genotypes, encoded as triples to a matrix encoded as 0, 1, 2, i.e.

c(0,0,1) <-> 0; c(0,1,0) <-> 1; c(0,0,1) <-> 2

First here is some code to generate the matrix that needs to be reduced.

# generate genotypes
expand.G = function(n,p){
  probs = runif(n = p)
  G012.rows = matrix(rbinom(2,prob = probs,n=n*p),nrow = p)
  colnames(G012.rows) = paste('s',1:n,sep = '')
  rownames(G012.rows) = paste('g',1:p, sep = '')
  G012.cols = t(G012.rows)

  expand.geno = function(g){
    if(g == 0){return(c(1,0,0))}
    if(g == 1){return(c(0,1,0))}
    if(g == 2){return(c(0,0,1))}
  }

  gtype = c()
  for(i in 1:length(c(G012.cols))){
    gtype = c(
      gtype,
      expand.geno(c(G012.cols)[i])
    )
  }

  length(gtype)

  G = matrix(gtype,byrow = T, nrow = p)
  colnames(G) = paste('s',rep(1:n,each = 3),c('1','2','3'),sep = '')
  rownames(G) = paste('g',1:p, sep = '')
  print(G[1:10,1:15])
  print(G012.rows[1:10,1:5])

  return(G)
}

The output has 3n columns and p rows, where n is sample size and p is number of genotypes. Now we can reduce the matrix back to 0,1,2 coding with the following functions

reduce012 = function(x){
  if(identical(x, c(1,0,0))){
    return(0)
  } else if(identical(x, c(0,1,0))){
    return(1)
  } else if(identical(x,  c(0,0,1))){
    return(2)
  } else { 
    return(NA)
  }
}

reduce.G = function(G.gen){
  G.vec = 
    mapply(function(i,j) reduce012(as.numeric(G.gen[i,(3*j-2):(3*j)])), 
           i=expand.grid(1:(ncol(G.gen)/3),1:nrow(G.gen))[,2], 
           j=expand.grid(1:(ncol(G.gen)/3),1:nrow(G.gen))[,1]
    )

  G = matrix(G.vec, nrow = ncol(G.gen)/3, ncol = nrow(G.gen))
  colnames(G) = rownames(G.gen)
  return(G)
}

reduce.G.loop = function(G.gen){
  G = matrix(NA,nrow = ncol(G.gen)/3, ncol = nrow(G.gen))
  for(i in 1:nrow(G.gen)){
    for(j in 1:(ncol(G.gen)/3)){
      G[j,i] = reduce012(as.numeric(G.gen[i,(3*j-2):(3*j)]))
    }
  }
  colnames(G) = rownames(G.gen)
  return(G)
}

The output is n rows by p columns. It is incidental, but intentional, that the matrix encoded as 0,1,2 is the transpose of the matrix encoded as triples.

The code is not particularly fast. What is bothering me is that the the timing goes with n^2. Can you explain or supply more efficient code?

G = expand.G(1000,20)
system.time(reduce.G(G))
system.time(reduce.G.loop(G))

G = expand.G(2000,20)
system.time(reduce.G(G))
system.time(reduce.G.loop(G))

G = expand.G(4000,20)
system.time(reduce.G(G))
system.time(reduce.G.loop(G))

Original Q&A

There are 2 answers

Christoph Wolk On 01 September 2017 at 19:38

This seems to be a about three times faster than your version (renamed reduce.G.orig):

reduce.G <- function(G) {
  varmap = c("100"=0, "010"=1, "001"=2)
  result <- do.call(rbind, lapply(1:(ncol(G)/3)-1, function(val) 
    varmap[paste(G[,3*val+1], G[,3*val+2], G[,3*val+3], sep="")]))
  colnames(result) <- rownames(G)
  result
}

system.time(reduce.G(G))
#   user  system elapsed 
#  0.156   0.000   0.155 

system.time(reduce.G.orig(G))
#   user  system elapsed 
#  0.444   0.000   0.441 

identical(reduce.G(G), reduce.G.orig(G))
# [1] TRUE

**F. Privé** · Accepted Answer · 2017-09-01T20:46:25+00:00

You can simply make an accessor lookup table:

decode <- array(dim = c(3, 3, 3))
decode[cbind(1, 0, 0) + 1] <- 0
decode[cbind(0, 1, 0) + 1] <- 1
decode[cbind(0, 0, 1) + 1] <- 2

And then, just do:

matrix(decode[matrix(t(G + 1), ncol = 3, byrow = TRUE)], ncol = nrow(G))

This full vectorized R version will give you the same matrix, without dimnames and super fast.

Yet, if you have much larger matrices, you should really use Rcpp for both memory and timing issues.

TechQA.

Efficient code to map genotype matrix in R

There are 2 answers

Related Questions in R

Related Questions in PERFORMANCE

Related Questions in MATRIX

Related Questions in GENOME

Popular Questions

Popular Tags

Trending Questions