How to parse a Uniprot Dat file to retrieve GO in python?

1.2k views Asked by At

I have tried BioPython SeqIO and other parsers but couldn't find any good tool to parse DAT files.

https://omics.pnl.gov/software/uniprot-dat-file-parser

I have tried this one but they don't provide any gene annotations

http://biopython.org/wiki/SeqIO

They mostly talk about taking inputs of FASTA and not DAT file.

from Bio import SeqIO
   for record in SeqIO.parse("Fasta/f002", "fasta"):
...     print("%s %i" % (record.id, len(record)))
2

There are 2 answers

0
Christian Ebeling On

Dear Muhammad Zeeshan,

you can use the query functions of the python library pyuniprot to get sequence (or many thing else)

install (with pip or git clone) and update. Find out which taxonomy identifier fits to your organisms. Example here (human, mouse, rat). Don't make a full update for all organisms (takes very long).

pyuniprot.update(taxids=[9606, 10090, 10116])

Use following python code for your problem:

Assuming 1433E_HUMAN and A4_HUMAN are the identifier of interest:

Python code:

import pyuniprot
query = pyuniprot.query() 
entries = query.entry(name=('1433E_HUMAN', 'A4_HUMAN'))  
seqs = [x.sequence.sequence for x in entries]
0
Peter Cock On

Those look like what Biopython calls "swiss" format, the plain text format used at SwissProt prior to it being called UniProt. Try:

from Bio import SeqIO
   for record in SeqIO.parse("example.dat", "swiss"):
       print("%s %i" % (record.id, len(record)))

See also the table for formats at http://biopython.org/wiki/SeqIO