What robust algorithm implementation can I use to perform phrase similarity with two inputs?

76 views Asked by At

This is the problem:

  • I have two columns in my matadata database "field name" and "field description"
  • I need to check if the "field description" is actually a description and not some sort of transformation of the "field name"
  • [Edit] I need to avoid preprocessing the text to remove separators, as I would have to consider a long list of cases (e.g. _-;$%/^| etc.)

Examples:

row field_name field_description
1 my_first_field my first field
2 my_second_field my------second------field
3 my_third_field this is a description about the field, the descriprion can contain the name of the field itself

Where the examples 1st and 2nd are similars (thus wrong) and the 3rd is correct.

I have tried some implementations based on Leveinshtein Distance, difflib, Cosine Similarity and an implementation called spaCy but none of them was robust with my examples (throwing only around 50% of similarity rate with the 1st example).

Some of the implementations I tried to use:

[Edit]

I have just tried the implementation of HuggingFace semantic-textual-similarity with nice results.

field_name field_description Score
my_field_name my_field_name 1.0000
second_field_name second field name 0.8483
third_field_name third-field-name 0.8717
fourth_field_name this is a correct description field 0.4591
fifth_field_name fifth_-------field_//////////////name 0.8454
1

There are 1 answers

4
Erwan On

For your examples, the Levenshtein edit distance would work very well. It can also be 'customized', or you could use some preprocessing depending on your data.

But your text description of the problem makes me think that the real problem is likely much more complex, and maybe not even easy to define formally. It looks like you actually need a more semantic method, and this would probably require training a model with annotated data.