How to detect if two sentences are simmilar, not in meaning, but in syllables/words?

40 views Asked by At

Here are some examples of the types of sentences that need to be considered "similar"

there was a most extraordinary noise going on shrinking rapidly she soon made out
there was a most extraordinary noise going on shrinking rapid
that will be a very little alice knew it was just possible it had
thou wilt be very little alice i knew it was possible to add
however at last it sat down and looked very anxiously into her face and
however that lives in sadtown and look very anxiously into him facing it
she went in search of her or of anything to say she simply bowed
she went in the search of her own or of anything to say
and she squeezed herself up on tiptoe and peeped over the wig he did
and she squeezed herself up on the tiptoe and peeped over her wig he did
she had not noticed before and behind it was very glad to find that
she had not noticed before and behind it it was very glad to find that
as soon as the soldiers had to fall a long hookah and taking not
soon as the soldiers have to fall along huka and taking knots

And here are some examples of more difficult edge cases I would be able to like to catch, but are not as necessary

so she tucked it under her arm with its head it would not join
she tucked it under her arm with its head
let me see four times five is twelve and four times five is twelve 
let me see  times  is  and  times  is
let me see four times seven is oh dear run home this moment and 
times  is o dear run home this moment and
in a minute or two she walked sadly down the middle being held up 
and then well see you sidely down the middle in health often

Sentences that are somewhat different and have no such similarities need to be marked as dissimilar. If there is an algorithm that exists that outputs a "score" versus just a boolean similar or not, I could determine what threshold would be necessary through my own testing.

The top sentence in each example is randomly generated; the bottom sentence is the output of a speech-to-text neural network, from an audio file of someone reading out the top line. If there is some syllabic comparison method that would be much more accurate given that I have the initial source text as well as the audio, I could also employ that instead of this word comparison technique.

My current method involves indexing each word, once forwards, and once reverse, and then checking how many words line up. If at least 10 words match in either indexing order, I count the sentences as similar. However, all of the presented examples are cases where this strategy does not work.

1

There are 1 answers

4
Ravindu On

One way to approach this (although might not be the best way) is to first vectorize the words in the two sentences (i.e. essentially giving a number to each word) which would give you a vector for each sentence. Then compare those two vectors for similarity.

Code-wise, you can do the following in python.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def sentence_similarity(sentence1, sentence2):
   # Tokenize sentences into n-grams
   ngram_range = (1, 3)  # range can be adjusted
   vectorizer = CountVectorizer(ngram_range=ngram_range)
   vectors = vectorizer.fit_transform([sentence1, sentence2])

   # Checking for similarity using cosine similarity (i.e. dot product)
   similarity_matrix = cosine_similarity(vectors)
   similarity_score = similarity_matrix[0, 1]

   return similarity_score

Please note that you need to have scikit learn installed in order to perform the above imports. You can do that by executing the following command in cmd or terminal.

   pip install scikit-learn