Protein Sequence Alignment from Protein Databank to Cosmic or Uniprot

788 views Asked by At

I would like to match up PDB files from the Protein Databank to canonical AA sequences for the protein as displayed in Cosmic or Uniprot. Specifically, what I need to do is pull from the pdb file, the carbon alpha atoms in the backbone and their xyz positions. I also need to pull their actual order in the proteins sequence. For structure 3GFT (Kras - Uniprot Accession Number P01116), this is easy, I can just take the ResSeq number. However, for some other proteins, I can't figure out how this is possible.

For example, for structure (2ZHQ) (protein F2 - Uniprot Accession Number P00734), the Seqres has the ResSeq numbers repeated for numbers "1" and "14" and only differs in the Icode entry. Further the icode entries are not in lexographic order so it's hard to tell what order to extract.

It get's even worse if you consider structure 3V5Q (Uniprot Accession Number Q16288). For most of the protein, the ResSeq number matches the actual amino acid from a source like COSMIC or UNIPROT. Howver after Position 711, it jumps to position 730. When looking at REMARK 465 (the missing atoms), it shows that for chain A , 726-729 are missing. However after matching it up to the protein, those AA actually are 712-715.

I've attached code that works fro the simple 3GFT example but if someone is an expert in pdb files and can help me get the rest of it figured out, I would be much obliged.

library(gdata)

#answer<- get.positions("http://www.pdb.org/pdb/files/2ZHQ.pdb","L")
answer<- get.positions("http://www.pdb.org/pdb/files/3GFT.pdb","A")


#This function reads a pdb file and returns the appropriate data structure
get.positions <- function(sourcefile, chain_required = "A"){
N <- 10^5
AACount <- 0
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(0, N),SideChain=rep(NA, N),XCoord=rep(0, N),YCoord=rep(0, N),ZCoord=rep(0, N),stringsAsFactors=FALSE)     

visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
  type = substr(input,14,15)
  if(type == "CA"){
    resSerial = strtoi(substr(input, 7,11))
    residue = substr(input,18,20)
    type_of_chain = substr(input,22,22)
    resSeq = strtoi(substr(input, 23,26))
    altLoc = substr(input,17,17)

    if(resSeq >=1){ #does not include negative residues
      if(type_of_chain == chain_required && !(resSerial %in% visited)  && (altLoc == " " || altLoc == "A") ){
        visited <- c(visited, resSerial)
        AACount <- AACount + 1
        position_string =list()
        position_string[[1]]= as.numeric(substr(input,31,38))
        position_string[[2]] =as.numeric(substr(input,39,46))
        position_string[[3]] =as.numeric(substr(input,47,54))
        #print(input)
        positions[AACount,]<- c(residue, resSeq, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]])
      }
    }
  }
}

  } 
  positions<-positions[1:AACount,]
  positions[,2]<- as.numeric(positions[,2])
  positions[,4]<- as.numeric(positions[,4])
  positions[,5]<- as.numeric(positions[,5])
  positions[,6]<- as.numeric(positions[,6])
  return (positions)
}
2

There are 2 answers

1
Jerven On

You might want to move this question to www.biostars.org and write to [email protected] (you do know that these sequences are already linked at a database level right?) In any case when writing to [email protected] ask for Jules Jacobsen as he is the resident UniProt expert on linking PDB structures to uniprot.org canonical sequences.

1
baz On

Here is one way. It requires the bio3d R package ( http://thegrantlab.org/bio3d/ ) and the muscle alignment executable be installed. I followed the instructions for 'Additional utilities' here http://thegrantlab.org/bio3d/tutorials/installing-bio3d

The example code below appears to work for the three cases you listed.

library(bio3d) 

## Download PDB file with given 'id' (can also read from online)
id <-  "3GFT" #"3V5Q"
file.name <- get.pdb(id)
pdb <- read.pdb(file.name)
pdb.seq <- pdbseq(pdb, atom.select(pdb, chain="A", elety="CA"))

## Get UniProt identifier and sequence
pdb.ano <- pdb.annotate(id, "db_id")
uni.seq <- get.seq(pdb.ano)

## Align sequences to define corespondences
aln <- seqaln( seqbind( pdb.seq, uni.seq), id=c(file.name, unlist(pdb.ano)) )

## Read aligned coordinate data (all the info you want is in here)
pdbs <- read.fasta.pdb(aln)

answer2 <- cbind( 1:ncol(pdbs$ali), t(pdbs$ali), 
            pdbs$resno[1,], matrix(pdbs$xyz[1,], ncol=3, byrow=T) ) 

head(answer2)

[1,] "1" "M"        "M"    "1" "62.935" "97.579" "30.223"
[2,] "2" "T"        "T"    "2" "63.155" "95.525" "27.079"
[3,] "3" "E"        "E"    "3" "65.289" "96.895" "24.308"
[4,] "4" "Y"        "Y"    "4" "64.899" "96.22"  "20.615"
[5,] "5" "K"        "K"    "5" "67.593" "96.715" "18.023"
[6,] "6" "L"        "L"    "6" "65.898" "97.863" "14.816"

There is a aa321() function in bio3d if you want your amino acids listed in 3 letter code.