Add filename to fasta headers of multiple fasta files inside loop

1.1k views Asked by At

I have 10 fasta files (each file with 20 gene sequences from each of the 10 samples). I would like to create 20 files, specific to each gene from 10 samples. I proceeded as follows to extract genes with the file_name in header:

pyfasta extract --header --fasta test.fasta gene_name1 | awk '/^>/ {$0=$0 "_file1"}1' > gene_name1.fasta

I am successful in creating multiple gene fasta files for each gene from each sample (a part from loop):

pyfasta extract --header --fasta $sample.fasta gene_name1 >> gene_name1.fasta 
pyfasta extract --header --fasta $sample.fasta gene_name2 >> gene_name2.fasta

But, I am unable to add file_name to the header of files in loop (but can do for 1 file as mentioned in the beginning).

Overall, my aim is to extract the genes with similar gene name from all the fasta files (multi-liner) and make gene specific fasta files with updated header including gene name and file name (so that I should know from which file that gene came) + append the gene sequences in the file with that gene name. Here are the sample input and output files:

Input files:
#file1.fasta

>gene1
ATGC..............................max upto 120 characters per line
TTTG..............................................................
>gene2
ATGA
>gene3
ATGTTT

#file2.fasta

>gene1
ATGG
>gene2
ATGC
>gene3
ATGTT

Expected output files:

#gene1.fasta
>gene1_file1
ATGC...........................................................
TTTG...........................................................
>gene1_file2
ATGG

#gene2.fasta
>gene2_file1
ATGA
>gene2_file2
ATGC

Kindly guide. Thanks.

1

There are 1 answers

0
Ed Morton On

Your question isn't clear but it sounds like all you need is:

... | awk -v fname="$sample" '/^>/ {$0=$0 "_" fname}1'