sort fasta by sequence size

1.8k views Asked by At

I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):

>id1

sequence 1 # could be on several line

>id2

sequence 2

...

I have run a tools that give me in tsv format:

the Identifiant, the length, and the position in bytes of the identifiant.

for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.

# this fonction will get the sequence using seek
def get_seq(file, bites):  

    with open(file) as f_:
        f_.seek(bites, 0) # go to the line of interest
        line = f_.readline().strip() # this line is the begin of the 
                                     #sequence
        to_return = "" # init the string which will contains the sequence

        while not line.startswith('>') or not line:  # while we do not 
                                                     # encounter another identifiant
        to_return += line
        line = f_.readline().strip()

    return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):

    with open(out_file, 'a') as out_file:
        out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))

# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:

    indice = 0

    for line in ref:

        spt = line.split()
        id_ = spt[0]
        seq = get_seq(args.i, int(spt[2]))
        write_seq(out_file=args.out, id_=id_, sequence=seq)

my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?

4

There are 4 answers

6
heathobrien On BEST ANSWER

Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:

subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:

    for line in ref:

        spt = line.split()
        id_ = spt[0]
        subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)
6
pjanek On

What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster.

csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}'

for file in tmp_fasta_file_*
do
  TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
  FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n")
done

for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}')
do
  cat "$filename" >> $2
done

rm tmp_fasta_file*

First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta

1
bli On

Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L option I added, and convert back to fasta:

cat test.fasta \
    | tee >(wc -l > nb_lines_fasta.txt) \
    | bioawk -c fastx '{l = length($seq); printf "@"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
    | tee >(wc -l > nb_lines_fastq.txt) \
    | fastq-sort -L \
    | tee >(wc -l > nb_lines_fastq_sorted.txt) \
    | bioawk -c fastx '{print ">"$name"\n"$seq}' \
    | tee >(wc -l > nb_lines_fasta_sorted.txt) \
    > test_sorted.fasta

The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.

The IIIII... string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads.

In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII... string was too short.

The resulting fasta file will have the shorter sequences first. To get the longest sequences at the top of the file, add the -r option to fastq-sort.

Note that fastq-sort writes intermediate files in /tmp. If for some reason it is interrupted before erasing them, you may want to clean your /tmp manually and not wait for the next reboot.

Edit

I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself:

cat test.fasta \
    | bioawk -c fastx '{print "@"$name"\n"$seq"\n+\n"$seq}' \
    | fastq-sort -L \
    | bioawk -c fastx '{print ">"$name"\n"$seq}' \
    > test_sorted.fasta

This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf and the usage of tee to check intermediate data length may be interesting to know about.

2
elcortegano On

You can also do it very conveniently with awk, check the code below:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta  |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -k1,1n | cut -f 2- | tr "\t" "\n"

This and other methods have been posted in Biostars (e.g. using BBMap's sortbyname.sh script), and I strongly recommend this community for questions such like this one.