I would like to match up PDB files from the Protein Databank to canonical AA sequences for the protein as displayed in Cosmic or Uniprot. Specifically, what I need to do is pull from the pdb file, the carbon alpha atoms in the backbone and their xyz positions. I also need to pull their actual order in the proteins sequence. For structure 3GFT (Kras - Uniprot Accession Number P01116), this is easy, I can just take the ResSeq number. However, for some other proteins, I can't figure out how this is possible.
For example, for structure (2ZHQ) (protein F2 - Uniprot Accession Number P00734), the Seqres has the ResSeq numbers repeated for numbers "1" and "14" and only differs in the Icode entry. Further the icode entries are not in lexographic order so it's hard to tell what order to extract.
It get's even worse if you consider structure 3V5Q (Uniprot Accession Number Q16288). For most of the protein, the ResSeq number matches the actual amino acid from a source like COSMIC or UNIPROT. Howver after Position 711, it jumps to position 730. When looking at REMARK 465 (the missing atoms), it shows that for chain A , 726-729 are missing. However after matching it up to the protein, those AA actually are 712-715.
I've attached code that works fro the simple 3GFT example but if someone is an expert in pdb files and can help me get the rest of it figured out, I would be much obliged.
library(gdata)
#answer<- get.positions("http://www.pdb.org/pdb/files/2ZHQ.pdb","L")
answer<- get.positions("http://www.pdb.org/pdb/files/3GFT.pdb","A")
#This function reads a pdb file and returns the appropriate data structure
get.positions <- function(sourcefile, chain_required = "A"){
N <- 10^5
AACount <- 0
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(0, N),SideChain=rep(NA, N),XCoord=rep(0, N),YCoord=rep(0, N),ZCoord=rep(0, N),stringsAsFactors=FALSE)
visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
type = substr(input,14,15)
if(type == "CA"){
resSerial = strtoi(substr(input, 7,11))
residue = substr(input,18,20)
type_of_chain = substr(input,22,22)
resSeq = strtoi(substr(input, 23,26))
altLoc = substr(input,17,17)
if(resSeq >=1){ #does not include negative residues
if(type_of_chain == chain_required && !(resSerial %in% visited) && (altLoc == " " || altLoc == "A") ){
visited <- c(visited, resSerial)
AACount <- AACount + 1
position_string =list()
position_string[[1]]= as.numeric(substr(input,31,38))
position_string[[2]] =as.numeric(substr(input,39,46))
position_string[[3]] =as.numeric(substr(input,47,54))
#print(input)
positions[AACount,]<- c(residue, resSeq, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]])
}
}
}
}
}
positions<-positions[1:AACount,]
positions[,2]<- as.numeric(positions[,2])
positions[,4]<- as.numeric(positions[,4])
positions[,5]<- as.numeric(positions[,5])
positions[,6]<- as.numeric(positions[,6])
return (positions)
}
You might want to move this question to www.biostars.org and write to [email protected] (you do know that these sequences are already linked at a database level right?) In any case when writing to [email protected] ask for Jules Jacobsen as he is the resident UniProt expert on linking PDB structures to uniprot.org canonical sequences.