replace the header line of several sequences in a fasta file and replace them with the species names stored in a list (.txt)

Question

replace the header line of several sequences in a fasta file and replace them with the species names stored in a list (.txt)

617 views Asked by shawn_smith_kaviedes At 23 April 2022 at 05:10

I have a fasta file with several sequences, but the first line of all the sequences start with the same string (ABI) and I want to change and replace it with the names of the species stored in a different text file.

My fasta file looks like

>ABI
AGCTAGTCCCGGGTTTATCGGCTATAC
>ABI
ACCCCTTGACTGACATGGTACGATGAC
>ABI
ATTTCGACTGGTGTCGATAGGCAGCAT
>ABI
ACGTGGCTGACATGTATGTAGCGATGA

The list of spp looks like this:

Alsophila cuspidata
Bunchosia argentea
Miconia cf.gracilis
Meliosma frondosa

How I can change those ABI headers of my sequences and replace them with the name of my species using that exact order.

Required output:

>Alsophila cuspidata
AGCTAGTCCCGGGTTTATCGGCTATAC
>Bunchosia argentea
ACCCCTTGACTGACATGGTACGATGAC
>Miconia cf.gracilis
ATTTCGACTGGTGTCGATAGGCAGCAT
>Meliosma frondosa
ACGTGGCTGACATGTATGTAGCGATGA

I was using something like:

awk '
FNR==NR{
  a[$1]=$2
  next
}
($2 in a) && /^>/{
  print ">"a[$2]
  next
}
1
' spp_list.txt FS="[> ]"  all_spp.fasta

This is not working, could someone guide me please.

Original Q&A

There are 1 answers

**Bguess** · Accepted Answer · 2022-04-23T06:24:34+00:00

Hello, not a dev so don't be rude.

Hope this will help you:

I create a file fasta.txt that contains:

>ABI
AGCTAGTCCCGGGTTTATCGGCTATAC
>ABI
ACCCCTTGACTGACATGGTACGATGAC
>ABI
ATTTCGACTGGTGTCGATAGGCAGCAT
>ABI
ACGTGGCTGACATGTATGTAGCGATGA

I also created a file spplist.txt that contains:

Alsophila cuspidata
Bunchosia argentea
Miconia cf.gracilis
Meliosma frondosa

I then created a python script named fasta.py, here it is:

#!/bin/python3

#import re library: https://docs.python.org/3/library/re.html
#import sys library: https://docs.python.org/3/library/sys.html
import re,sys

#saving the reference of the standard output into "original_stdout"
original_stdout = sys.stdout


with open("spplist.txt", "r") as spplist:
    x = spplist.readlines()
    with open("fasta.txt", "r") as fasta:
        output_file = open("output.txt", "w")
        #redirecting standard output to output_file
        sys.stdout = output_file

        for line in fasta:
            if re.match(r">ABI", line):
                print(x[0].rstrip())
                del x[0]
            else:
                print(line.rstrip())

        #restoring the native standard output
        sys.stdout = original_stdout

#Notify the user at the end of the work
print("job done")

(these three file need to be in the same directory if you want the script to work as it is)

Here is my directoy tree:

❯ tree
.
├── fasta.py
├── fasta.txt
└── spplist.txt

To execute the script, open a shell, cd in the directory and type:

❯ python3 fasta.py
job done

You will see a new file named output.txt in the directory:

❯ tree
.
├── fasta.py
├── fasta.txt
├── output.txt
└── spplist.txt

and here is its content:

Alsophila cuspidata
AGCTAGTCCCGGGTTTATCGGCTATAC
Bunchosia argentea
ACCCCTTGACTGACATGGTACGATGAC
Miconia cf.gracilis
ATTTCGACTGGTGTCGATAGGCAGCAT
Meliosma frondosa
ACGTGGCTGACATGTATGTAGCGATGA

Hope this can help you out. bguess.

TechQA.

replace the header line of several sequences in a fasta file and replace them with the species names stored in a list (.txt)

There are 1 answers

Related Questions in BASH

Related Questions in AWK

Related Questions in SEQUENCE

Related Questions in RENAME

Related Questions in SPP

Popular Questions

Popular Tags

Trending Questions