Reciprocal rank fusion in PySpark

42 views Asked by theodre7 At 02 February 2024 at 16:36

I'm dealing with a large scale (10M+) retrieval problem, where I have q queries and D documents. I've computed the top k nearest documents for each query using 4 embedding models. Now I want to rerank these 3 sets of results using reciprocal rank fusion. All the implementations that I could find use for loop and that doesn't seem feasible since sequentially iterating across so many number of queries will take a lot of time.

For clarity, my similarity matrices look like below:

Embed_1: "query_1": {"doc_10": 0.3, "doc_11": 0.37, "doc_94": 0.38, "doc_1": 0.5, ...}
Embed_2: "query_1": {"doc_5": 0.06, "doc_96": 0.09, "doc_10": 0.12, "doc_8": 0.3, ...}
Embed_3: "query_1": {"doc_11": 0.49, "doc_2": 0.82, "doc_37": 0.97, "doc_4": 1.0, ...}

I want top k document IDs reranked using RRF for query_1. I tried using multiprocessing but CPU of a machine is a bottleneck, but PySpark can scale to multiple nodes and complete this sooner.

Let me know if this question needs more clarity.

Original Q&A

TechQA.

Reciprocal rank fusion in PySpark

There are 0 answers

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in NLP

Related Questions in EMBEDDING

Related Questions in INFORMATION-RETRIEVAL

Popular Questions

Trending Questions