This is the problem:
- I have two columns in my matadata database "field name" and "field description"
- I need to check if the "field description" is actually a description and not some sort of transformation of the "field name"
- [Edit] I need to avoid preprocessing the text to remove separators, as I would have to consider a long list of cases (e.g. _-;$%/^| etc.)
Examples:
row | field_name | field_description |
---|---|---|
1 | my_first_field | my first field |
2 | my_second_field | my------second------field |
3 | my_third_field | this is a description about the field, the descriprion can contain the name of the field itself |
Where the examples 1st and 2nd are similars (thus wrong) and the 3rd is correct.
I have tried some implementations based on Leveinshtein Distance, difflib, Cosine Similarity and an implementation called spaCy but none of them was robust with my examples (throwing only around 50% of similarity rate with the 1st example).
Some of the implementations I tried to use:
- https://towardsdatascience.com/surprisingly-effective-way-to-name-matching-in-python-1a67328e670e
- https://spacy.io/usage/linguistic-features#vectors-similarity
- https://docs.python.org/3/library/difflib.html
- is there a way to check similarity between two full sentences in python?
[Edit]
I have just tried the implementation of HuggingFace semantic-textual-similarity with nice results.
field_name | field_description | Score |
---|---|---|
my_field_name | my_field_name | 1.0000 |
second_field_name | second field name | 0.8483 |
third_field_name | third-field-name | 0.8717 |
fourth_field_name | this is a correct description field | 0.4591 |
fifth_field_name | fifth_-------field_//////////////name | 0.8454 |
For your examples, the Levenshtein edit distance would work very well. It can also be 'customized', or you could use some preprocessing depending on your data.
But your text description of the problem makes me think that the real problem is likely much more complex, and maybe not even easy to define formally. It looks like you actually need a more semantic method, and this would probably require training a model with annotated data.