Reciprocal rank fusion in PySpark

42 views Asked by At

I'm dealing with a large scale (10M+) retrieval problem, where I have q queries and D documents. I've computed the top k nearest documents for each query using 4 embedding models. Now I want to rerank these 3 sets of results using reciprocal rank fusion. All the implementations that I could find use for loop and that doesn't seem feasible since sequentially iterating across so many number of queries will take a lot of time.

For clarity, my similarity matrices look like below:

Embed_1: "query_1": {"doc_10": 0.3, "doc_11": 0.37, "doc_94": 0.38, "doc_1": 0.5, ...}
Embed_2: "query_1": {"doc_5": 0.06, "doc_96": 0.09, "doc_10": 0.12, "doc_8": 0.3, ...}
Embed_3: "query_1": {"doc_11": 0.49, "doc_2": 0.82, "doc_37": 0.97, "doc_4": 1.0, ...}

I want top k document IDs reranked using RRF for query_1. I tried using multiprocessing but CPU of a machine is a bottleneck, but PySpark can scale to multiple nodes and complete this sooner.

Let me know if this question needs more clarity.

0

There are 0 answers