Python: Extract DNA sequence from FASTA file using Bed file

4.3k views Asked by At

May I know how can I extract dna sequence from fasta file? I tried bedtools and samtools. Bedtools getfasta did well but for some of my file return "warning: chromosome was not found in fasta file" but the fact is the chromosome name in bed file and fasta are exactly the same. I'm looking for other alternative that python can do this task for me.

Bed file:
chr1:117223140-117223856 3 7
chr1:117223140-117223856 5 9

Fasta file:
>chr1:117223140-117223856
CGCGTGGGCTAGGGGCTAGCCCC

Desired output:
>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC

3

There are 3 answers

1
BioGeek On BEST ANSWER

BioPython is what you want to use:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict

# read names and postions from bed file
positions = defaultdict(list)
with open('positions.bed') as f:
    for line in f:
        name, start, stop = line.split()
        positions[name].append((int(start), int(stop)))

# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('sequences.fasta'), 'fasta'))

# search for short sequences
short_seq_records = []
for name in positions:
    for (start, stop) in positions[name]:
        long_seq_record = records[name]
        long_seq = long_seq_record.seq
        alphabet = long_seq.alphabet
        short_seq = str(long_seq)[start-1:stop]
        short_seq_record = SeqRecord(Seq(short_seq, alphabet), id=name, description='')
        short_seq_records.append(short_seq_record)

# write to file
with open('output.fasta', 'w') as f:
    SeqIO.write(short_seq_records, f, 'fasta')
0
Emanuel Langit On

Your bed file needs to be tab-delimited for bedtools to use it. Replace your colons, dashes, and spaces with a tab.

The BedTools doc page says "bedtools requires that all BED input files (and input received from stdin) are tab-delimited." BedTools.

0
Jose Ricardo Bustos M. On

try, with:

from Bio import SeqIO

#I use RAM, and to store fasta in dictionary
parser = SeqIO.parse(open("input.fasta")
dict_fasta = dict([(seq.id, seq) for seq in parser, "fasta")])

output = open("output.fasta", "w")
for line in open("input.bed"):
  id, begin, end = line.split()
  if id in dict_fasta:
    #[int(begin)-1:int(end)] if the first base in a chromosome is numbered 1
    #[int(begin):int(end)+1] if the first base in a chromosome is numbered 0
    output.write(dict_fasta[id][int(begin)-1:int(end)].format("fasta"))
  else:
    print id + " don't found"

output.close()

you get, first base in a chromosome is numbered 1:

>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC

you get, first base in a chromosome is numbered 0:

>chr1:117223140-117223856
GTGGG
>chr1:117223140-117223856
GGGCT