Decision that texts or sentences are equivalent in content

38 views Asked by At

The classic example of determining similarity as distance Word Mover's Distance as for example here https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html, word2vec model on GoogleNews-vectors-negative300.bin, D1="Obama speaks to the media in Illinois",D2="The president greets the press in Chicago",D3="Oranges are my favorite fruit". When calculated wmd distances: distance (D1,D2)=3.3741, distance (D1,D3)=4.3802. So we understand that (D1,D2) more similar than (D1,D3). What is the threshold value for vmd distance to decide that the two sentences actually contain almost the same information? Maybe in the case of sentences D1 and D2, the value of 3.3741 is too large and in reality these sentences are different? Such decisions need to be made, for example, when there is a question, a sample of the correct answer and a student's answer. Addition after the answer by gojomo: Let's postpone identification and automatic understanding of logic for later. Let's consider the case when in two sentences there is an enumeration of objects, or properties and actions of one object in a positive way, and we need to evaluate how similar the content of these two sentences is.

1

There are 1 answers

0
gojomo On

I don't believe there's any absolute threshold that could be used as you wish.

The "Word Mover's Distance" can offer some impressive results in finding highly-similar texts, especially in relative comparison to other candidate texts.

However, its magnitude may be affected by the sizes of the texts, and further it has no understanding of rigorous grammar/semantics. Thus things like subtle negations or contrasts, or things that would be nonsense to a native speaker, won't be highlighted as very "different" from other statements.

For example, the two phrases "Many historians agree Obama is absolutely positively the best President of the 21st century", and "Many historians agree Obama is absolutely positively not the best President of the 21st century", will be noted as incredibly similar by most measures based on word-statistics, such as Word Mover's Distance. Yet, the insertion of one word means they convey somewhat opposite ideas.