I am working on a problem where I need to find exact or similar sentences in two or more documents. I read a lot about cosine similarity and how it can be used to detect similar text.
Here is the code that I tried:
my_file = open("test.txt", "r")
content = my_file.read()
content_list = content.split(".")
my_file.close()
print("test:"content_list)
my_file = open("original.txt", "r")
og = my_file.read()
print("og:"og)
Output
test:['As machines become increasingly capable', ' tasks considered to require "intelligence" are often removed from the definition of AI,']
og:AI applications include advanced web search engines (e.g., Google), recommendation systems (used by YouTube, Amazon and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Tesla), automated decision-making and competing at the highest level in strategic game systems (such as chess and Go).[2][citation needed] As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect.[3] For instance, optical character recognition is frequently excluded from things considered to be AI,[4] having become a routine technology.
but when I am using Cosine similarity, using the code:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def compute_cosine_similarity(text1, text2):
# stores text in a list
list_text = [text1, text2]
# converts text into vectors with the TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit_transform(list_text)
tfidf_text1, tfidf_text2 =
vectorizer.transform([list_text[0]]),
vectorizer.transform([list_text[1]])
# computes the cosine similarity
cs_score = cosine_similarity(tfidf_text1, tfidf_text2)
return np.round(cs_score[0][0],2)
for i in content_list:
cosine_similarity12 = compute_cosine_similarity(i,og)
print('The cosine similarity of sentence 1 and 2 is
{}.'.format(cosine_similarity12))
the output I am getting is:
The cosine similarity of sentence and og is 0.14. The cosine similarity of sentence and og is 0.4.
I tried splitting the test sentence by '.' and then tried to compare each sentence with the original document. But the cosine similarity results are not what I expected. I need to know what I am doing wrong and how I can get similar sentences from the original document for plagiarism checking. The condition being I want to point out similar sentences(or exact sentences) from the original document. I even thought of comparing each line of two documents (test, og), but that would really increase the complexity. I am worried because cosine similarity isn't giving a good score even when I just used the exact same sentences from a big paragraph. I really need help in this and would like to know what am doing wrong.