Implementation of TextRank algorithm using Spark(Calculating cosine similarity matrix using spark)

Question

Implementation of TextRank algorithm using Spark(Calculating cosine similarity matrix using spark)

587 views Asked by Jayesh Dubey At 20 July 2020 at 04:39

I am trying to implement textrank algorithm where I am calculating cosine-similarity matrix for all the sentences.I want to parallelize the task of similarity matrix creation using Spark but don't know how to implement it.Here is the code:

    cluster_summary_dict = {}
    for cluster,sentences in tqdm(cluster_wise_sen.items()):
        sen_sim_matrix = np.zeros([len(sentences),len(sentences)])
        for row in range(len(sentences)):
            for col in range(len(sentences)):
                if row != col:
                    sen_sim_matrix[row][col] = cosine_similarity(cluster_dict[cluster]  
                                               [row].reshape(1,100), cluster_dict[cluster] 
                                               [col].reshape(1,100))[0,0]
        sentence_graph = nx.from_numpy_array(sen_sim_matrix)
        scores = nx.pagerank(sentence_graph) 
        pagerank_sentences = sorted(((scores[k],sent) for k,sent in enumerate(sentences)), 
                             reverse=True)
        cluster_summary_dict[cluster] = pagerank_sentences

Here,cluster_wise_sen is a dictionary that contains list of sentences for different clusters ({'cluster 1' : [list of sentences] ,...., 'cluster n' : [list of sentences]}). cluster_dict contains the 100d vector representation of the sentences. I have to compute the sentence similarity matrix for each cluster. Since it is time consuming, therefore looking to parallelize it using spark.

Original Q&A

There are 1 answers

**Abhinav** · Accepted Answer · 2020-07-20T16:24:48+00:00

The experiments with large scale matrix calculation for cosine similarity are well written in here!

To achieve speed and not compromising much on the accuracy, you can also try hashing methods like Min-Hash and evaluate Jaccard Distance similarity. It comes with a nice implementation with Spark ML-lib, the documentation has very detailed examples for reference: http://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance

TechQA.

Implementation of TextRank algorithm using Spark(Calculating cosine similarity matrix using spark)

There are 1 answers

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in NLP

Related Questions in TEXTRANK

Popular Questions

Popular Tags

Trending Questions