I have a large fasta files with various bacterial species names in each of the sequence headers that looks something like this:
file.fasta
>Bacteria;Actinobacteria;Actinobacteria;Streptomyces;Streptomycetaceae;Streptomyces;Streptomyces_sp._AA4;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Bacteria;Actinobacteria;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Amycolatopsis;Amycolatopsis_niigatensis;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
What I would like to do is search through each of the headers for a single species, Streptomyces, and replace the entire header with just "Streptomyces" if it is listed, else replace the entire header "Not Streptomyces":
new_file.fasta
>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
My first instinct is to use something like awk or sed to do this replacement, but I run into trouble figuring out how to replace the entire string.
How should I go about this?
In any awk you can do:
Or with GNU awk for the word boundary regex:
Or more tersely:
Or if you can rely that the start is always
>Bacteria;and the line always ends in;(as in your example) then you can do (in any awk):Ruby:
ANy of those prints: