So I got a uniprot flat file in fasta format (https://www.uniprot.org/uniprot/P30988.txt here is the link but it's supposed to be able to work with any uniprot flat file) and I want to essentially take the transmembrane parts from the sequence and print them out. Here is what I have done so far:
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if($_=~ m/^ID/) {
print $_ ; }
if($_=~ m/^AC/){
print $_ ; }
#if ($_=~ m/^SQ/){
#print $_; }
if ($_=~ m/^\s+(.+)/) {
$seq.=$1; # Here I match ONLY the sequence
$seq=~s/\s//g; print $seq; # Here I removed spaces from the sequence
}
if($_=~ m/^FT\s+TRANSMEM\s+(\d+)\.\.(\d+)/){
$s=$1; $e=$2;} # Now I know that
# this is the way to match the lines that mention the
# transmembrane parts and the parentheses are the number of
# the amino acids on the sequence BUT as it has multiple
# transmembrane domains I am not very sure how to handle it.
$tr=substr ($seq, $s-1, $e-$s+1); # here I try to use the numbers
# given in the file to get the sequences of each part
push (@transmembrane, $tr);
print @transmembrane; # Now this is something I tried and while
# it printed a sequence it was not the same length as the total
# amount of amino acids that are in the transmembrane parts. Not
# sure whats up with that.
The issue I have is that I'm not sure how to deal with the fact that $1 and $2 are not one value but as many as there are transmembrane parts.
If any of you know how to make this work I'd appreciate it!
[ As with @choroba I have no idea what a transmembrane is, so I'm guessing in the code below ]
First thing - you script doesn't compile - you are missing the closing
}for thewhileblock.Next observation is in this block of code (I've removed your comments and reformatted make the code clearer)
I'm guessing that you want the
$trvariable updated every time that the regex matches and added to@transmembrane. Unfortunately, these two lines are not in the scope of theifstatement, so it will only happen once.Next the statement
print @transmembrane;is in the scope of thewhileblock, so will be output after every line is read fromP30988.txt. It needs to be moved out of thewhileblock scope.The next very fundamental issue is you are running the
substrstatement against the$seqvariable before it has been populated. Looking atP30988.txt, I see that theFT TRANSMEMlines come before the sequence data. The$seqvariable will not have been populated when you run thesubstrstatement.That means to need to store the
FT TRANSMEMoffsets and run thesubstrcommands once the while loop terminates.Here is a rewrite that does that (note I've commented out the printing of
$seq)I get this when I run the modified code