Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables

Question

Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables

5.1k views Asked by eoe At 06 January 2017 at 11:12

I have 4 tables with schema (app, text_id, title, text). Now I'd like to compute the cosine similarity between all possible text pairs (title & text concatenated) and store them eventually in a csv file with fields (app1, app2, text_id1, text1, text_id2, text2, cosine_similarity).

Since there are a lot of possible combinations it should run quite efficient. What is the most common approach here? I'd appreciate any pointers.

Edit: Although the provided reference might touch my problem, I still cant figure out how to approach this. Could someone provide more details on the strategy to accomplish this task? Next to the calculated cosine similarity I need also the corresponding text pairs as an output.

Original Q&A

There are 1 answers

**tttthomasssss** · Accepted Answer · 2017-01-07T23:45:26+00:00

The following is a minimal example to calculate the pairwise cosine similarities between a set of documents (assuming you have successfully retrieved the title and text from your database).

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Assume thats the data we have (4 short documents)
data = [
    'I like beer and pizza',
    'I love pizza and pasta',
    'I prefer wine over beer',
    'Thou shalt not pass'
]

# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(data) # `X` will now be a TF-IDF representation of the data, the first row of `X` corresponds to the first sentence in `data`

# Calculate the pairwise cosine similarities (depending on the amount of data that you are going to have this could take a while)
S = cosine_similarity(X)

'''
S looks as follows:
array([[ 1.        ,  0.4078538 ,  0.19297924,  0.        ],
       [ 0.4078538 ,  1.        ,  0.        ,  0.        ],
       [ 0.19297924,  0.        ,  1.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

The first row of `S` contains the cosine similarities to every other element in `X`. 
For example the cosine similarity of the first sentence to the third sentence is ~0.193. 
Obviously the similarity of every sentence/document to itself is 1 (hence the diagonal of the sim matrix will be all ones). 
Given that all indices are consistent it is straightforward to extract the corresponding sentences to the similarities.
'''

TechQA.

Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables

There are 1 answers

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in SCIKIT-LEARN

Related Questions in TEXT-MINING

Related Questions in COSINE-SIMILARITY

Popular Questions

Popular Tags

Trending Questions