What robust algorithm implementation can I use to perform phrase similarity with two inputs?

Question

What robust algorithm implementation can I use to perform phrase similarity with two inputs?

76 views Asked by emichester At 08 November 2022 at 14:30

This is the problem:

I have two columns in my matadata database "field name" and "field description"
I need to check if the "field description" is actually a description and not some sort of transformation of the "field name"
[Edit] I need to avoid preprocessing the text to remove separators, as I would have to consider a long list of cases (e.g. _-;$%/^| etc.)

Examples:

row	field_name	field_description
1	my_first_field	my first field
2	my_second_field	my------second------field
3	my_third_field	this is a description about the field, the descriprion can contain the name of the field itself

Where the examples 1^st and 2^nd are similars (thus wrong) and the 3^rd is correct.

I have tried some implementations based on Leveinshtein Distance, difflib, Cosine Similarity and an implementation called spaCy but none of them was robust with my examples (throwing only around 50% of similarity rate with the 1^st example).

Some of the implementations I tried to use:

[Edit]

I have just tried the implementation of HuggingFace semantic-textual-similarity with nice results.

field_name	field_description	Score
my_field_name	my_field_name	1.0000
second_field_name	second field name	0.8483
third_field_name	third-field-name	0.8717
fourth_field_name	this is a correct description field	0.4591
fifth_field_name	fifth_-------field_//////////////name	0.8454

Original Q&A

There are 1 answers

**Erwan** · Answer 1 · 2022-11-08T15:46:18+00:00

For your examples, the Levenshtein edit distance would work very well. It can also be 'customized', or you could use some preprocessing depending on your data.

But your text description of the problem makes me think that the real problem is likely much more complex, and maybe not even easy to define formally. It looks like you actually need a more semantic method, and this would probably require training a model with annotated data.

TechQA.

What robust algorithm implementation can I use to perform phrase similarity with two inputs?

There are 1 answers

Related Questions in NLP

Related Questions in STRING-COMPARISON

Related Questions in SIMILARITY

Related Questions in SENTENCE-SIMILARITY

Related Questions in NAME-MATCHING

Popular Questions

Popular Tags

Trending Questions