faster way to agrep? Quickly find every single character mis-match

320 views Asked by At

I am looking for the fastest way to find every single character mis-match between every word in a large file. If I have this:

AAAA
AAAB
AABA
BBBB
CCCC

I would like to get something like this:

AAAA - AAAB AABA
AAAB - AAAA
AABA - AAAA
BBBB
CCCC

Currently I am using agrep but as my file is millions of lines long and it is very slow. Each word is on its own line and they are all the same number of characters. I expect there is something elegant I have not been able to find. thank you

Edit: The words are made up of just 5 characters, A T C G or N and they are just under 100 characters long. The whole thing should fit in memory (<5GB). There is one word per line and I want to compare it to every other word.

Edit2: The example was not correct It is fixed now.

2

There are 2 answers

2
ysth On BEST ANSWER

If you are looking for words that have only a one-character difference, there are a couple of tricks you can use. First, to compare two words and count the number of characters different, you use this:

( $word1 ^ $word2 ) =~ tr/\0//c

This does a stringwise exclusive or on the two words; wherever the characters are the same, a "\0" will result; where they are not the same, a non-"\0" will result. tr, in its complement counting mode, counts the differences.

Second, noting that either the first half or the last half of the word must match exactly, partition the words into a hash by their first and last halves, reducing the number of other words you need to check a given word against.

This approach should only two or three times the memory of all the strings (plus a little overhead); it could be reduced to one to two times the memory by pushing \$word and using $$_ in the grep and sort map $$_, @match in the output at the cost of some speed.

If the words are all the same length, the top level of the hash can be removed and two different hashes used for the words' beginnings and endings.

use strict;
use warnings;
use autodie;
my %strings;

my $filename = shift or die "no filename provided\n";
open my $fh, '<', $filename;
while (my $word = readline $fh) {
    chomp $word;
    push @{ $strings{ 'b' . length $word }{ substr($word, 0, length($word)/2)} }, $word;
    push @{ $strings{ 'e' . length $word }{ substr($word, length($word)/2)} }, $word;
}
seek $fh, 0, 0;
while (my $word = readline $fh) {
    chomp $word;
    my @match = grep 1 == ($word ^ $_) =~ tr/\0//c, @{ $strings{ 'b' . length $word }{ substr($word, 0, length($word)/2) } }, @{ $strings{ 'e' . length $word }{ substr($word, length($word)/2) } };
    if (@match) {
        print "$word - " . join( ' ', sort @match ) . "\n";
    }
    else {
        print "$word\n";
    }
}

Note that this only looks for substitutions, not insertions, deletions, or transpositions.

3
Miller On

It requires a large memory footprint, but the following can accomplish your task in two passes:

#!/usr/bin/env perl

use strict;
use warnings;

use Fcntl qw(:seek);

my $fh = \*DATA;

my $startpos = tell $fh;

my %group;

while (<$fh>) {
    chomp;

    my $word = $_;

    for my $i ( 0 .. length($word) - 1 ) {
        substr my $star = $word, $i, 1, "\0";
        push @{ $group{$star} }, \$word;
    }
}

seek $fh, $startpos, SEEK_SET;

while (<$fh>) {
    chomp;

    my %uniq;

    my $word = $_;

    for my $i ( 0 .. length($word) - 1 ) {
        substr my $star = $word, $i, 1, "\0";
        $uniq{$_}++ for map $$_, @{ $group{$star} };
    }

    delete $uniq{$word};

    print "$word - ", join(' ', sort keys %uniq), "\n";
}

__END__
AAAA
AAAB
AABA
BBBB
CCCC

Outputs:

AAAA - AAAB AABA
AAAB - AAAA
AABA - AAAA
BBBB - 
CCCC -