May I know how can I extract dna sequence from fasta file? I tried bedtools and samtools. Bedtools getfasta did well but for some of my file return "warning: chromosome was not found in fasta file" but the fact is the chromosome name in bed file and fasta are exactly the same. I'm looking for other alternative that python can do this task for me.
Bed file:
chr1:117223140-117223856 3 7
chr1:117223140-117223856 5 9
Fasta file:
>chr1:117223140-117223856
CGCGTGGGCTAGGGGCTAGCCCC
Desired output:
>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC
Python: Extract DNA sequence from FASTA file using Bed file
4.3k views Asked by Allyson At
3
There are 3 answers
0
On
Your bed file needs to be tab-delimited for bedtools to use it. Replace your colons, dashes, and spaces with a tab.
The BedTools doc page says "bedtools requires that all BED input files (and input received from stdin) are tab-delimited." BedTools.
0
On
try, with:
from Bio import SeqIO
#I use RAM, and to store fasta in dictionary
parser = SeqIO.parse(open("input.fasta")
dict_fasta = dict([(seq.id, seq) for seq in parser, "fasta")])
output = open("output.fasta", "w")
for line in open("input.bed"):
id, begin, end = line.split()
if id in dict_fasta:
#[int(begin)-1:int(end)] if the first base in a chromosome is numbered 1
#[int(begin):int(end)+1] if the first base in a chromosome is numbered 0
output.write(dict_fasta[id][int(begin)-1:int(end)].format("fasta"))
else:
print id + " don't found"
output.close()
you get, first base in a chromosome is numbered 1:
>chr1:117223140-117223856 CGTGG >chr1:117223140-117223856 TGGGC
you get, first base in a chromosome is numbered 0:
>chr1:117223140-117223856 GTGGG >chr1:117223140-117223856 GGGCT
BioPython
is what you want to use: