highlight similar sentences in two documents and not just display similarity score

387 views Asked by At

I am working on a problem where I need to find exact or similar sentences in two or more documents. I read a lot about cosine similarity and how it can be used to detect similar text.

Here is the code that I tried:

my_file = open("test.txt", "r")
content = my_file.read()

content_list = content.split(".")
my_file.close()
print("test:"content_list)
my_file = open("original.txt", "r")
og = my_file.read()
print("og:"og)

Output

test:['As machines become increasingly capable', ' tasks considered to require "intelligence" are often removed from the definition of AI,']

og:AI applications include advanced web search engines (e.g., Google), recommendation systems (used by YouTube, Amazon and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Tesla), automated decision-making and competing at the highest level in strategic game systems (such as chess and Go).[2][citation needed] As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect.[3] For instance, optical character recognition is frequently excluded from things considered to be AI,[4] having become a routine technology.

but when I am using Cosine similarity, using the code:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity



def compute_cosine_similarity(text1, text2):
    
    # stores text in a list
    list_text = [text1, text2]
    
    # converts text into vectors with the TF-IDF 
    vectorizer = TfidfVectorizer(stop_words='english')
    vectorizer.fit_transform(list_text)
    tfidf_text1, tfidf_text2 = 
    vectorizer.transform([list_text[0]]), 
    vectorizer.transform([list_text[1]])
    
    # computes the cosine similarity
    cs_score = cosine_similarity(tfidf_text1, tfidf_text2)
    
    return np.round(cs_score[0][0],2)



for i in content_list:
     cosine_similarity12 = compute_cosine_similarity(i,og)
     print('The cosine similarity of sentence 1 and 2 is 
     {}.'.format(cosine_similarity12))

the output I am getting is:

The cosine similarity of sentence and og is 0.14.
The cosine similarity of sentence and og is 0.4.

I tried splitting the test sentence by '.' and then tried to compare each sentence with the original document. But the cosine similarity results are not what I expected. I need to know what I am doing wrong and how I can get similar sentences from the original document for plagiarism checking. The condition being I want to point out similar sentences(or exact sentences) from the original document. I even thought of comparing each line of two documents (test, og), but that would really increase the complexity. I am worried because cosine similarity isn't giving a good score even when I just used the exact same sentences from a big paragraph. I really need help in this and would like to know what am doing wrong.

0

There are 0 answers