I have a long string of DNA sequence, and I need to find regions consist of two palindromic sequences flanking a spacer sequence.
The input is:
cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcgatcgatcgatgctagcggcgcgatcgatgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa
This is my code:
use strict;
use warnings;
my $input= $ARGV[0];
chomp $input;
open (my $fh_in, "<", $input) or die "Cannot open file $input $!";
my $dna= <$fh_in>;
chomp $dna;
#######################################################################################
if ($dna=~ /[^ACGT]/gi) {
print "This is not a valid DNA sequence, it has unknown base(s)\n";
}
$dna=~ tr/[acgt]/[ACGT]/;
######################################################################################
print "Minimum length of palindromic sequence?\n";
my $min= <STDIN>;
chomp $min;
print "Maximum length of palindromic sequence?\n";
my $max= <STDIN>;
chomp $max;
print "Minimum length of spacer region?\n";
my $min_spacer= <STDIN>;
chomp $min_spacer;
print "Maximum length of spacer region?\n";
my $max_spacer= <STDIN>;
chomp $max_spacer;
######################################################################################
my $dna_length= length($dna);
my ($length , $offset , $string_1 , $string_2);
for ($offset= 0 ; $offset <= $dna_length-$max-$max-$max_spacer ; $offset++) {
for ($length= $min ; $length <= $max ; $length++) {
$string_1= substr ($dna, $offset, $length);
$string_2= reverse $string_1;
$string_2=~ tr/[ACGT]/[TGCA]/;
if ($dna=~ /(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/) {
print "IR starts at $offset => $2***$3***$4\n$1\n\n";
}
}
}
With parameters: $min = 6 , $max = 12 , $min_spacer = 4 , $max_spacer = 12 The output I get is:
IR starts at 26 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA
IR starts at 27 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 118 => CGGATG***GCCAAGGC***CATCCG
CGGATGGCCAAGGCCATCCG
IR starts at 118 => CGGATGG***CCAAGG***CCATCCG
CGGATGGCCAAGGCCATCCG
IR starts at 118 => CGGATGGC***CAAG***GCCATCCG
CGGATGGCCAAGGCCATCCG
IR starts at 119 => GGATGG***CCAAGG***CCATCC
GGATGGCCAAGGCCATCC
IR starts at 119 => GGATGGC***CAAG***GCCATCC
GGATGGCCAAGGCCATCC
IR starts at 120 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC
IR starts at 136 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 164 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 252 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 254 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT
IR starts at 254 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT
IR starts at 255 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA
IR starts at 256 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 258 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT
IR starts at 274 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 276 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT
IR starts at 304 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT
IR starts at 304 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT
IR starts at 305 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA
IR starts at 306 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 314 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC
However when I check the region of my first hit (highlighted in bold in input), the offset of this hit doesn't seem to be at position 26. Could anyone enlighten me what's wrong with my code? Thanks.
Your problem is your regex is looking for a palindromes anywhere in the sequence, not just at the location of the offset. "ATCGATCG" occurs at offset 26, so it matches. You need to add some positional information to your regex. Try something like