In Perl, I would like to sort a collection of different length strings in a way that automatically lumps together similar strings.
Intuitively, I imagine I need some distance measure for each pair and then a clustering routine that groups by the distance.
My number of strings is always small and they are short, see an example below.
Is there a simple approach that will do what I need in
sort_magic_here
?
#!/usr/bin/perl
use strict;
my @list =
("JK_HJ_Lancaster", "SY4_TS_HJ_1000ng",
"NB_E_200cc_caHJ_Rep1", "HB_E_100cc_caHJ_Rep1",
"HB_E_200cc_caHJ_Rep1", "Normal_Lancaster",
"NB15_OP_HJ_1000ng","Zoey_HJ_Slough",
"NB_E_100cc_caHJ_Rep1","Normal_Slough",
"JK_caHJ_Slough","Zoey_HJ_Lancaster");
print "# Straight sort\n";
foreach my $elem (sort @list) {
print "$elem\n";
}
print "# Sort grouped by string distance\n";
foreach my $elem (sort { sort_magic_here() } @list) {
print "$elem\n";
}
Custom sorts take two inputs, perform a 'comparison' and respond with -1, 0 or 1 depending on whether they are before, after or equal.
Sorting is designed for making a positional order, not really for 'grouping stuff that's vaguely similar'.
You do have the
Text::Levenshtein
module which quickly lets you compute that - but you have to do something altogether more complicated because you'd need to compare each word to each other word before being able to decide ordering. But frankly, you will have the same problem with any 'similar words' sort of comparison.In this, you're starting to look at graph theory and grouping based on that. It's quite a complicated problem though - it's far from as trivial as 'just' sorting.
I'd be looking at something like:
Which gives the result:
Which at least approximately groups like you request. You can probably get this more efficient, because you don't need to compute all the distances which'll reduce the algorithm complexity. But you also will get different groups based on proximity and start point.