I am looking for the fastest way to find every single character mis-match between every word in a large file. If I have this:
AAAA
AAAB
AABA
BBBB
CCCC
I would like to get something like this:
AAAA - AAAB AABA
AAAB - AAAA
AABA - AAAA
BBBB
CCCC
Currently I am using agrep but as my file is millions of lines long and it is very slow. Each word is on its own line and they are all the same number of characters. I expect there is something elegant I have not been able to find. thank you
Edit: The words are made up of just 5 characters, A T C G or N and they are just under 100 characters long. The whole thing should fit in memory (<5GB). There is one word per line and I want to compare it to every other word.
Edit2: The example was not correct It is fixed now.
If you are looking for words that have only a one-character difference, there are a couple of tricks you can use. First, to compare two words and count the number of characters different, you use this:
This does a stringwise exclusive or on the two words; wherever the characters are the same, a "\0" will result; where they are not the same, a non-"\0" will result. tr, in its complement counting mode, counts the differences.
Second, noting that either the first half or the last half of the word must match exactly, partition the words into a hash by their first and last halves, reducing the number of other words you need to check a given word against.
This approach should only two or three times the memory of all the strings (plus a little overhead); it could be reduced to one to two times the memory by pushing
\$word
and using$$_
in the grep and sort map $$_, @match in the output at the cost of some speed.If the words are all the same length, the top level of the hash can be removed and two different hashes used for the words' beginnings and endings.
Note that this only looks for substitutions, not insertions, deletions, or transpositions.