Finding inverted repeats in DNA sequence

1k views Asked by At

I have a long string of DNA sequence, and I need to find regions consist of two palindromic sequences flanking a spacer sequence.

The input is:

cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcgatcgatcgatgctagcggcgcgatcgatgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa

This is my code:

use strict;
use warnings;

my $input= $ARGV[0];
chomp $input;
open (my $fh_in, "<", $input) or die "Cannot open file $input $!";
my $dna= <$fh_in>;
chomp $dna;

#######################################################################################

if ($dna=~ /[^ACGT]/gi) {
        print "This is not a valid DNA sequence, it has unknown base(s)\n";
}

$dna=~ tr/[acgt]/[ACGT]/;


######################################################################################

print "Minimum length of palindromic sequence?\n";
my $min= <STDIN>;
chomp $min;

print "Maximum length of palindromic sequence?\n";
my $max= <STDIN>;
chomp $max;

print "Minimum length of spacer region?\n";
my $min_spacer= <STDIN>;
chomp $min_spacer;

print "Maximum length of spacer region?\n";
my $max_spacer= <STDIN>;
chomp $max_spacer;

###################################################################################### 

my $dna_length= length($dna);
my ($length , $offset , $string_1 , $string_2);

for ($offset= 0 ; $offset <= $dna_length-$max-$max-$max_spacer ; $offset++) {
        for ($length= $min ; $length <= $max ; $length++) {
                $string_1= substr ($dna, $offset, $length);
                $string_2= reverse $string_1;
                $string_2=~ tr/[ACGT]/[TGCA]/;
                if ($dna=~ /(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/) {
                        print "IR starts at $offset => $2***$3***$4\n$1\n\n";
                }
        }
}

With parameters: $min = 6 , $max = 12 , $min_spacer = 4 , $max_spacer = 12 The output I get is:

IR starts at 26 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 27 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 118 => CGGATG***GCCAAGGC***CATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 118 => CGGATGG***CCAAGG***CCATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 118 => CGGATGGC***CAAG***GCCATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 119 => GGATGG***CCAAGG***CCATCC
GGATGGCCAAGGCCATCC

IR starts at 119 => GGATGGC***CAAG***GCCATCC
GGATGGCCAAGGCCATCC

IR starts at 120 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC

IR starts at 136 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 164 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 252 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 254 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 254 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT

IR starts at 255 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 256 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 258 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 274 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 276 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 304 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 304 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT

IR starts at 305 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 306 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 314 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC

However when I check the region of my first hit (highlighted in bold in input), the offset of this hit doesn't seem to be at position 26. Could anyone enlighten me what's wrong with my code? Thanks.

2

There are 2 answers

1
heathobrien On BEST ANSWER

Your problem is your regex is looking for a palindromes anywhere in the sequence, not just at the location of the offset. "ATCGATCG" occurs at offset 26, so it matches. You need to add some positional information to your regex. Try something like

/^[ACGT]{$offset}(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/
0
Patrick J. S. On

Here is one solution, it uses the experimental (??{}) feature, which is said to change for a long time, but hasn't yet.

How it works: it calls the subroutine convert from within the regular expression and converts the first matching group into a desired regular expression of the outputstring. the rest (backtracking etc) is handled by the regex engine. Sadly interpolating variables as delimiting lengths doesn't sit well with the regex parsing, so I had to use a string to do that. Refrain from that if possible.

use warnings;
use strict;
use 5.01;
use re 'eval'; # needed, because of (??{})

my %c=
  (min_pali   =>    (shift) // 6,
   max_pali   =>    (shift) // 12,
   min_spacer =>    (shift) // 4,
   max_spacer =>    (shift) // 12,
  );

my $re1 = "(.{$c{min_pali},$c{max_pali}})(.{$c{min_spacer},$c{max_spacer}})(??{convert})";
while(<DATA>){
  chomp;
  $_ = uc $_;
  my $converted;
  sub convert {
    my $var = reverse $1;
    $var =~ tr{ACGT}{TGCA};
    $converted =  $var;
  }
  while (/$re1/g) {
    printf "%3d => %s**%s**%s\n", $-[0],$1,$2,$converted;
    pos = $-[0] + 1; # start next match one character after the last match start
  }
}

__DATA__
cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcgatcgatcgatgctagcggcgcgatcgatgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa

OUTPUT:

118 => CGGATG**GCCAAGGC**CATCCG
119 => GGATGG**CCAAGG**CCATCC
120 => GATGGC**CAAG**GCCATC
254 => ATCGATCG**ATGCTAGCGGCG**CGATCGAT
255 => TCGATCG**ATGCTAGCGGCG**CGATCGA
256 => CGATCG**ATGCTAGCGGCG**CGATCG
258 => ATCGAT**GCTAGCGGCGCG**ATCGAT
314 => GATGGC**CAAG**GCCATC

Also I'm not sure if that is a problem, but you could produce longer palidrome sequences that just get shifted into the spacer with this solution:

 Assuming length 2 – 4, spacer= 2 – 4 (X's are unintresting bits)
 ACACAXXTGTGT => ACAC**AXXT**GTGT