Edit line names with a new name containing an incremented value

158 views Asked by At

This seems like a simple task to me but getting it to work easily is ending up more difficult than I thought:

I have a fasta file containing several million lines of text (only a few hundred individual sequence entries) and these sequence names are long, I want to replace all characters after the header > with Contig $n, where $n is an integer starting at 1 and is incremented for each replacement.

an example input sequence name:

>NODE:345643RD:Cov_456:GC47:34thgd
ATGTCGATGCGT
>NODE...
ATGCGCTTACAC

Which I then want to output like this

>Contig 1
ATGTCGATGCGT
>Contig 2
ATGCGCTTACAC

so maybe a Perl script? I know some basics but I'd like to read in a file and then output the new file with the changes, and I'm unsure of the best way to do this? I've seen some Perl one liner examples but none did what I wanted.

$n = 1

if { 

    s/>.*/(Contig)++$n/e

    ++$n
}
5

There are 5 answers

0
shivams On

Try something like this:

#!/usr/bin/perl -w

use strict;

open (my $fh, '<','example.txt');
open (my $fh1, '>','example2.txt');

my $n = 1;

# For each line of the input file
while(<$fh>) {

    # Try to update the name, if successful, increment $n
    if ($_ =~ s/^>.*/>Contig$n/) { $n++; }

    print $fh1 $_;
}
0
josifoski On

I'm not awk expert (far from that), but solved this only for curiosity and because sed don't contain variables (limited possibilities).

One possible gawk solution could be

awk -v n=1 '/^>/{print ">Contig " n; n++; next}1' <file
2
stevieb On
perl -i -pe 's/>.*/">Contig " . ++$c/e;' file.txt

Output:

\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC
0
Ed Morton On
$ awk '/^\\>/{$0="\\>Contig "++n} 1' file
\>Contig 1

ATGTCGATGCGT

\>Contig 2

ATGCGCTTACAC
0
mob On

When you use the /e modifier, Perl expects the substitution pattern to be a valid Perl expression. Try something like

s/>.*/">Contig " . ++$n/e