How should I go about implementing conditional string replacements in a fasta file?

93 views Asked by At

I have a large fasta files with various bacterial species names in each of the sequence headers that looks something like this:

file.fasta

>Bacteria;Actinobacteria;Actinobacteria;Streptomyces;Streptomycetaceae;Streptomyces;Streptomyces_sp._AA4;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Bacteria;Actinobacteria;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Amycolatopsis;Amycolatopsis_niigatensis;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG

What I would like to do is search through each of the headers for a single species, Streptomyces, and replace the entire header with just "Streptomyces" if it is listed, else replace the entire header "Not Streptomyces":

new_file.fasta

>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG

My first instinct is to use something like awk or sed to do this replacement, but I run into trouble figuring out how to replace the entire string.

How should I go about this?

4

There are 4 answers

0
dawg On BEST ANSWER

In any awk you can do:

awk '/^>/{
        s="Not Streptomyces"
        n=split($0,fields,";")
        for(i=1;i<=n;i++) if (fields[i]=="Streptomyces") s="Streptomyces"
        $0=">" s
} 1
' file

Or with GNU awk for the word boundary regex:

gawk '/^>/ { 
            if ($0~/\<Streptomyces\>/) 
                $0="Streptomyces"
            else 
                $0="Not Streptomyces"
            }
1
' file

Or more tersely:

gawk '/^>/ { $0=">" ($0~/\<Streptomyces\>/ ? "" : "Not ") "Streptomyces" }1' file

Or if you can rely that the start is always >Bacteria; and the line always ends in ; (as in your example) then you can do (in any awk):

awk '/^>/ { $0=">" ($0~/;Streptomyces;/ ? "" : "Not ") "Streptomyces"  } 1' file

Ruby:

ruby -lpe 'if /^>/ then $_ = /\bStreptomyces\b/ ? ">Streptomyces" : ">Not Streptomyces" end' file 

ANy of those prints:

>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
0
markp-fuso On

Assumptions:

  • the species will always have a pair of semicolon (;) bookends

One awk idea:

awk '
/^>/ { if ($0 ~ /;Streptomyces;/)          # if header line and contains Streptomyces then ...
          $0 = ">Streptomyces"             # redefine current line
       else                                # else ...
           $0 = ">Not Streptomyces"        # redefine current line
     }
1                                          # print current line
' fasta.dat

Another awk idea that uses a shell variable to dynamically define the species to search for:

spec='Streptomyces'                        # shell variable assignment

awk -v species="${spec}" '                 # set awk variable "species" to value of shell variable "spec"
/^>/  { if ($0 ~ ";" species ";")          # if header contains our species then ...
           $0 = ">" species
        else
            $0 = ">Not " species
      }
1
' fasta.dat

These both generate:

>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
0
potong On

This might work for you (GNU sed):

sed -E 's/^>.*\b(Streptomyces)\b.*/>\1/I;t;s/^>.*/>Not Streptomyces/' file

If a line beginning with > and contains the word Streptomyces, replace it with >Streptomyces.

Otherwise, if a line beginning withe >, replace it with >Not Streptomyces.

0
ufopilot On
$ awk -F';' -v spec=Streptomyces '/^>/{print($0~spec ? ">"spec : ">Not "spec); next}1' file
>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG