perl custom sort by string similarity clustering

331 views Asked by At

In Perl, I would like to sort a collection of different length strings in a way that automatically lumps together similar strings.

Intuitively, I imagine I need some distance measure for each pair and then a clustering routine that groups by the distance.

My number of strings is always small and they are short, see an example below.

Is there a simple approach that will do what I need in sort_magic_here?

#!/usr/bin/perl
use strict;

my @list =
  ("JK_HJ_Lancaster", "SY4_TS_HJ_1000ng",
   "NB_E_200cc_caHJ_Rep1", "HB_E_100cc_caHJ_Rep1",
   "HB_E_200cc_caHJ_Rep1", "Normal_Lancaster",
   "NB15_OP_HJ_1000ng","Zoey_HJ_Slough",
   "NB_E_100cc_caHJ_Rep1","Normal_Slough",
   "JK_caHJ_Slough","Zoey_HJ_Lancaster");

print "# Straight sort\n";
foreach my $elem (sort @list) {
  print "$elem\n";
}

print "# Sort grouped by string distance\n";
foreach my $elem (sort { sort_magic_here() }  @list) {
  print "$elem\n";
}
1

There are 1 answers

3
Sobrique On BEST ANSWER

Custom sorts take two inputs, perform a 'comparison' and respond with -1, 0 or 1 depending on whether they are before, after or equal.

Sorting is designed for making a positional order, not really for 'grouping stuff that's vaguely similar'.

You do have the Text::Levenshtein module which quickly lets you compute that - but you have to do something altogether more complicated because you'd need to compare each word to each other word before being able to decide ordering. But frankly, you will have the same problem with any 'similar words' sort of comparison.

In this, you're starting to look at graph theory and grouping based on that. It's quite a complicated problem though - it's far from as trivial as 'just' sorting.

I'd be looking at something like:

#!/usr/bin/perl
use strict;
use warnings;

use Text::Levenshtein qw ( distance );
use Data::Dumper;

my @list = (
    "JK_HJ_Lancaster",      "SY4_TS_HJ_1000ng",
    "NB_E_200cc_caHJ_Rep1", "HB_E_100cc_caHJ_Rep1",
    "HB_E_200cc_caHJ_Rep1", "Normal_Lancaster",
    "NB15_OP_HJ_1000ng",    "Zoey_HJ_Slough",
    "NB_E_100cc_caHJ_Rep1", "Normal_Slough",
    "JK_caHJ_Slough",       "Zoey_HJ_Lancaster"
);

my %distances;

foreach my $elem (@list) {
    foreach my $compare (@list) {
        next if $elem eq $compare;
        my $distance = distance( $elem, $compare );
        $distances{$elem}{$compare} = $distance;
    }
}

print Dumper \%distances;

my %seen;
my ($cursor) = sort @list;

while ($cursor) {
    print "$cursor\n";
    $seen{$cursor}++;
    my @near_words_in_order =
        sort { $distances{$cursor}{$a} <=> $distances{$cursor}{$b} }
        keys %{ $distances{$cursor} };

    #      print @near_words_in_order;
    last unless @near_words_in_order;
    while ( $seen{$cursor} ) {
        $cursor = shift(@near_words_in_order) // 0;
    }
}

Which gives the result:

HB_E_100cc_caHJ_Rep1
HB_E_200cc_caHJ_Rep1
NB_E_200cc_caHJ_Rep1
NB_E_100cc_caHJ_Rep1
NB15_OP_HJ_1000ng
SY4_TS_HJ_1000ng
Zoey_HJ_Slough
JK_caHJ_Slough
Normal_Slough
Normal_Lancaster
JK_HJ_Lancaster
Zoey_HJ_Lancaster

Which at least approximately groups like you request. You can probably get this more efficient, because you don't need to compute all the distances which'll reduce the algorithm complexity. But you also will get different groups based on proximity and start point.