Substitute pattern by an other pattern from two lists using while read line and sed

49 views Asked by At

I try to replace in several files some patterns by other patterns. For example my infile looks like this:

>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGCTGGGCCTGCGAGACACCAGCACCCCCATCGTGGCCATCACCCTGCACAGCCTCGCCGTGCTGGTCTCCCTGCTCGGACCAGAGGTGGTTGTGGGCGGAGAAAGAACCAAGATCTTCAAACGCACTGCCCCCAGCTTTACAAAAACCACTGACCTCTCCCCAGAAGAC

and I want output:

>Genus_species_Something_something|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGCTGGGCCTGCGAGACACCAGCACCCCCATCGTGGCCATCACCCTGCACAGCCTCGCCGTGCTGGTCTCCCTGCTCGGACCAGAGGTGGTTGTGGGCGGAGAAAGAACCAAGATCTTCAAACGCACTGCCCCCAGCTTTACAAAAACCACTGACCTCTCCCCAGAAGAC

I have two list files, my old patterns:

Genus_species_SRR13259292

and new patterns:

Genus_species_Something_something

I tried to do this with sed. Here is my command:

while IFS= read -r line1 && IFS= read -r line2 <&3; do
    for f in *.fasta; do
        sed -e "s/${line1}/${line2}/g" "$f" > "${f%.fasta}_NewName.fasta"
    done
done < "List_oldpattern.txt" 3<"List_newpatterns.txt"

But this doesn't work, maybe it is because of the > and | delimited the pattern?

If sed doesn't work it may be possible with Awk?

3

There are 3 answers

3
markp-fuso On BEST ANSWER

Since the question has been tagged with awk I propose we replace all of OP's current code with a single awk script ...

My sample .fasta files:

$ head f?.fasta
==> f1.fasta <==
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772           # change
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....

>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772           # do not change
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....

==> f2.fasta <==
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772           # change
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....

>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772           # do not change
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....

NOTE: files do not contain the comments

We'll make use of the paste command to append OP's old and new patterns into a single line; we'll use a | as the delimiter:

$ paste -d'|' List_oldpattern.txt List_newpatterns.txt
Genus_species_SRR13259292|Genus_species_Something_something

Now the awk script:

awk '
BEGIN     { FS = OFS = "|" }                    # input/output field delimiter
FNR==NR   { map[">" $1] = ">" $2; next }        # 1st file (paste output): populate our map[] array; $1==old $2==new; then skip to next input line
FNR==1    { close(outf)                         # 2nd-nth files: 1st record; close previous output file
            outf = FILENAME                     # make copy of input FILENAME
            sub(/\.fasta$/,"",outf)             # strip trailing ".fasta"
            outf = outf "_NewName.fasta"        # append new suffix to our output filename
          }
$1 in map { $1 = map[$1] }                      # if 1st field (">some_string") is an index in the map[] array then replace 1st field with array contents
          { print > outf }                      # print current line to output file

' <(paste -d'|' List_oldpattern.txt List_newpatterns.txt) *.fasta

NOTE: assuming OP has more than one old/new pattern pair, this script has the added benefit of only scanning each *.fasta file once (as opposed to OP's current while/read/for/sed loop which scans each .fasta file N times - where N is the number of old/new pattern pairs)

This generates:

$ head *_NewName.fasta
==> f1_NewName.fasta <==
>Genus_species_Something_something|ENSG00000000457_ENST00000367772   # changed
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....

>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772           # not changed
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....

==> f2_NewName.fasta <==
>Genus_species_Something_something|ENSG00000000457_ENST00000367772   # changed
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....

>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772           # not changed
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
1
jhnc On
awk -F'|' '
    FNR==1 { n++ }
    n==1 {    f[FNR]   = ">"$0; next }
    n==2 { t[ f[FNR] ] = ">"$0; next }

    { print $1 in t ? t[$1] substr($0,length($1)+1) : $0 }
' fromlist tolist fastafile
  • store from strings from first file
  • create from→to map from corresponding line of second file
  • split fasta lines on |
  • if first field is in the map, convert it, else print as-is
0
potong On

This might work for you (GNU sed):

sed -i.bak -E '1{x;s/.*/paste fileOld fileNew/e;x};G
               s/^>([^|]+)([^\n]+).*\n\1\t([^\n]+)/>\3\2/;P;d' file ...

This solution makes a copy of the pasted old and new files in the hold space and then appends this copy to each line and uses pattern matching to subtitute old for new.

The files are replaced by the -i option and backups are made of each file with the extention .bak.