Bcbio-gff File creation issue

239 views Asked by At

When creating a file using GFF.write(), i get a new line with "annotation remark" as a source, followed by ASCII encoding of sequence regions:

##gff-version 3
##sequence-region NC_011594.1 1 16779
NC_011594.1 annotation  remark  1   16779   .   .   .   gff-version=3;sequence-region=%28%27NC_011594.1%27%2C 0%2C 16971%29,%28%27NC_042493.1%27%2C 0%2C 132544852%29, (continues on and on)
NC_011594.1 RefSeq  gene    1   1531    .   +   .   Dbxref=GeneID:7055888;ID=gene-COX1;Name=COX1;gbkey=Gene;gene=COX1;gene_biotype=protein_coding

Any idea why it's here, what it's for and how i could avoid it? I fear it might become a problem when using it in third-party softwares.

I imported only the bcbio-gff package, but I believe it's part of Biopython, link: https://biopython.org/wiki/GFF_Parsing

1

There are 1 answers

0
Marek Schwarz On BEST ANSWER

To your first question - "Why it is there?"

  • I only presume, that by default the package author wanted to export as much information as possible.

To your next question - "How can I avoid it?"

  • Unfortunately there is no off switch. For me the solution was to remove any annotations from the exported sequences. (i.e. set the annotations attribute to empty dictionary before calling the GFF.write().

Example:

from Bio import SeqIO
from BCBio import GFF

g = SeqIO.read('NC_003888.3.gb','gb')

g.annotations = {}

with open('t2.gff', 'w') as f:
    GFF.write([g], f)

Output file head - no # annotation remark

head t2.gff 
##gff-version 3
##sequence-region NC_003888.3 1 8667507
NC_003888.3 feature source  1   8667507 ... removed for clarity ....