Python: Extract DNA sequence from FASTA file using Bed file

Question

Python: Extract DNA sequence from FASTA file using Bed file

4.3k views Asked by Allyson At 06 January 2025 at 13:01

May I know how can I extract dna sequence from fasta file? I tried bedtools and samtools. Bedtools getfasta did well but for some of my file return "warning: chromosome was not found in fasta file" but the fact is the chromosome name in bed file and fasta are exactly the same. I'm looking for other alternative that python can do this task for me.

Bed file:
chr1:117223140-117223856 3 7
chr1:117223140-117223856 5 9

Fasta file:
>chr1:117223140-117223856
CGCGTGGGCTAGGGGCTAGCCCC

Desired output:
>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC

Original Q&A

There are 3 answers

Emanuel Langit On 05 February 2016 at 01:20

Your bed file needs to be tab-delimited for bedtools to use it. Replace your colons, dashes, and spaces with a tab.

The BedTools doc page says "bedtools requires that all BED input files (and input received from stdin) are tab-delimited." BedTools.

Jose Ricardo Bustos M. On 28 May 2015 at 10:58

try, with:

from Bio import SeqIO

#I use RAM, and to store fasta in dictionary
parser = SeqIO.parse(open("input.fasta")
dict_fasta = dict([(seq.id, seq) for seq in parser, "fasta")])

output = open("output.fasta", "w")
for line in open("input.bed"):
  id, begin, end = line.split()
  if id in dict_fasta:
    #[int(begin)-1:int(end)] if the first base in a chromosome is numbered 1
    #[int(begin):int(end)+1] if the first base in a chromosome is numbered 0
    output.write(dict_fasta[id][int(begin)-1:int(end)].format("fasta"))
  else:
    print id + " don't found"

output.close()

you get, first base in a chromosome is numbered 1:

>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC

you get, first base in a chromosome is numbered 0:

>chr1:117223140-117223856
GTGGG
>chr1:117223140-117223856
GGGCT

**BioGeek** · Accepted Answer · 2015-05-28T10:56:49+00:00

BioPython is what you want to use:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict

# read names and postions from bed file
positions = defaultdict(list)
with open('positions.bed') as f:
    for line in f:
        name, start, stop = line.split()
        positions[name].append((int(start), int(stop)))

# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('sequences.fasta'), 'fasta'))

# search for short sequences
short_seq_records = []
for name in positions:
    for (start, stop) in positions[name]:
        long_seq_record = records[name]
        long_seq = long_seq_record.seq
        alphabet = long_seq.alphabet
        short_seq = str(long_seq)[start-1:stop]
        short_seq_record = SeqRecord(Seq(short_seq, alphabet), id=name, description='')
        short_seq_records.append(short_seq_record)

# write to file
with open('output.fasta', 'w') as f:
    SeqIO.write(short_seq_records, f, 'fasta')

TechQA.

Python: Extract DNA sequence from FASTA file using Bed file

There are 3 answers

Related Questions in PYTHON

Related Questions in BIOINFORMATICS

Related Questions in BIOPYTHON

Related Questions in FASTA

Related Questions in BED

Popular Questions

Popular Tags

Trending Questions