Complement a DNA sequence

11.6k views Asked by At

Suppose I have a DNA sequence. I want to get the complement of it. I used the following code but I am not getting it. What am I doing wrong ?

s=readline()
ATCTCGGCGCGCATCGCGTACGCTACTAGC
p=unlist(strsplit(s,""))
h=rep("N",nchar(s))
unlist(lapply(p,function(d){
for b in (1:nchar(s)) {    
    if (p[b]=="A") h[b]="T"
    if (p[b]=="T") h[b]="A"
    if (p[b]=="G") h[b]="C"
    if (p[b]=="C") h[b]="G"
}
7

There are 7 answers

5
IRTFM On
sapply(p, switch,  "A"="T", "T"="A","G"="C","C"="G")
  A   T   C   T   C   G   G   C   G   C   G   C   A   T   C   G   C   G   T 
"T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A" 
  A   C   G   C   T   A   C   T   A   G   C 
"T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G" 

If you do not want the complementary names, you can always strip them with unname.

unname(sapply(p, switch,  "A"="T", "T"="A","G"="C","C"="G") )
 [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C"
[19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
> 
0
Martin Morgan On

The Bioconductor package Biostrings has many useful functions for this sort of operation. Install once:

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")

then use

library(Biostrings)
dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA"))
complement(dna)
0
Spacedman On

Use chartr which is built for this purpose:

> s
[1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> chartr("ATGC","TACG",s)
[1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG"

Just give it two equal-length character strings and your string. Also vectorised over the argument for translation:

> chartr("ATGC","TACG",c("AAAACG","TTTTT"))
[1] "TTTTGC" "AAAAA" 

Note I'm doing the replacement on the string representation of the DNA rather than the vector. To convert the vector I'd create a lookup-map as a named vector and index that:

> p
 [1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T"
[20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C"
> map=c("A"="T", "T"="A","G"="C","C"="G")
> unname(map[p])
 [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
[20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
0
pedrosaurio On

Here a answer using base r. Written with a horrible formatting to make things clear and to keep it as a one-liner. It supports upper and lower cases.

revc = function(s){
       paste0(
           rev(
            unlist(
             strsplit(
                chartr("ATGCatgc","TACGtacg",s)
                      , "")                        # from strsplit
                   )                               # from unlist
               )                                   # from rev
             , collapse='')                        # from paste0
       }
4
JeremyS On

There is also a package seqinr

library(seqinr)
comp(seq) # gives complement
rev(comp(seq)) # gives the reverse complement

Biostrings has a much smaller memory profile, but seqinr is nice also because you can choose the case of the bases (including mixed) and change them to anything you want, for example if you want a mix of T and U in the same sequence. Biostrings forces you to have either T or U.

0
Tom Kelly On

I've generalised the solution rev(comp(seq)) with the seqinr package:

install.packages("devtools")
devtools::install_github("TomKellyGenetics/tktools")
tktools::revcomp(seq)

This version is compatible with string inputs and is vectorised to handle list or vector input of multiple strings. The output class should match the input, including cases and types. This also support inputs containing "U" for RNA and RNA output sequences.

> seq <- "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> revcomp(seq)
[1] "GCTAGTAGCGTACGCGATGCGCGCCGAGAT"

> seq <- c("TATAAT", "TTTCGC", "atgcat")
> revcomp(seq)
  TATAAT   TTTCGC   atgcat 
 "ATTATA" "GCGAAA" "atgcat" 

See the manual or the TomKellyGenetics/tktools github package repository.

0
Megatron On

To complement, in both upper and lower case, you can use chartr():

n <- "ACCTGccatGCATC"
chartr("acgtACGT", "tgcaTGCA", n)
# [1] "TGGACggtaCGTAG"

To take it a step further and reverse complement the nucleotide sequence, you can use the following function:

library(stringi)

rc <- function(nucSeq)
  return(stri_reverse(chartr("acgtACGT", "tgcaTGCA", nucSeq)))

rc("AcACGTgtT")
# [1] "AacACGTgT"