renaming fasta file using a mapping txt file

79 views Asked by At

Edit solution: https://tejashree1modak.github.io/bioblogs/fasta_rename/

I am looking to rename my scaffolds using a scaffold_mapping.txt file in unix where the .txt file looks like:

$ head scaffold_mapping.txt 
>#ID_covAvg_fold_lengthLength
>scaffold_1_c1_cov61.3780_length417825
>scaffold_3_c1_cov45.0025_length77714
>scaffold_4_c1_cov84.2432_length70007
>scaffold_5_c2_cov57.6219_length67890
>scaffold_6_c1_cov331.1665_length65908
>scaffold_7_c1_cov138.5574_length64984
>scaffold_9_c1_cov77.1170_length59223
>scaffold_2_c2_cov51.1554_length55365
>scaffold_11_c1_cov44.1476_length53538

Each scaffold in the fasta file is currently named like this:

> scaffold_1_c1

And I would like their names to match the scaffold_mapping.txt file, so that the previous example would then be:

> scaffold_1_c1_cov61.3780_length417825

I hoped it'd be easy with sed but the '>' is complicating matters

$ sed -f scaffold_mapping1.txt assembly.contigs.fasta > output1.fasta
sed: file scaffold_mapping1.txt line 1: unknown command: `>'
2

There are 2 answers

2
Ed Morton On

It's not that "the '>' is complicating matters", you're just telling sed to interpret a file that doesn't contain a sed script.

The question isn't clear but best I can tell this is what the OP wants, using any POSIX awk:

awk -F'[> ]+' '
    NR==FNR {
        match($2,/([^_]+_){3}/)
        map[substr($2,1,RLENGTH-1)] = $2
        next
    }
    /^>/ && ($2 in map) {
        $0 = "> " map[$2]
    }
    { print }
' scaffold_mapping.txt the_fasta_file

which will output the posted expected output:

> scaffold_1_c1_cov61.3780_length417825

from the posted sample input.

0
Timur Shtatland On

Create a mapping file first. Use any scripting language, such as Perl. Then use the mapping file to replace the FASTA headers:

tail -n +2 scaffold_mapping.txt | perl -lpe 's{>((scaffold_\d+_c\d+).+)}{${2}\t${1}};' > map.tsv

perl -lpe 'BEGIN { %new = map { chomp; split; } `cat map.tsv`; } s{^>(\S+)}{>$new{$1}}; ' assembly.contigs.fasta > out.fasta

Contents of out.fasta:

>scaffold_1_c1_cov61.3780_length417825
ACTG
>scaffold_1_c1_cov61.3780_length417825
ACTGACTG
>scaffold_4_c1_cov84.2432_length70007
ACTGACTGACTG
231031_1055 m-apd-shtatland test% 

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.

BEGIN { ... } : Execute the code before running the rest of the code, here, before parsing the fasta file.
%new = map { chomp; split; } `cat map.tsv`; : slurp the entire mapping file, store the results in %new hash.
s{^>(\S+)}{>$new{$1}}; : Change the fasta headers (= lines starting with > from old to new usage using the %new hash. $1 stores the sequence id, that is whatever was captured inside the parentheses.

See also: