Finetuning Sentence Transformer Embedding Models to NFCorpus Dataset - Does this pre-processing make sense?

60 views Asked by At

I am trying to finetune a Sentence Transformer Embedding model to the NF Corpus. The dataset consists of a qrels folder which has train.tsv, test.tsv and dev.tsv files. It basically maps a query ID to multiple passage IDs.

I was following the method provided by Sentence Transformer website. Github Link. In this example, they use a Cross-Encoder that maps query ID to passage ID with a score. I tried to replicate this with one to one query-to-sentence mapping.

My plan is to use this augmented dataset for training. I am planning to use Embedding Similarity Evaluator and Cosine Similarity Loss when training. I understand that the training data itself considers a Sentence Transformers' results as the ground truth.

I have coded this because my assumption is you cannot really map a query to a passage, you need one to one mapping of query and sentence in the corpus; or at least that is what I have understood when looking at the various losses and evaluation classes provided by Sentence Transformers.

I guess my question is does this really make sense? Because I am essentially using a Bi-Encoder and a Cross-Encoder combination to train a Bi-Encoder.

This is the code that I have written.

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer, CrossEncoder

class Vectorstore:
  def __init__(self,model,documents):
    self.index = faiss.IndexFlatL2(model.get_sentence_embedding_dimension())
    self.embeddings = model.encode(documents)
    faiss.normalize_L2(self.embeddings)
    self.index.add(self.embeddings)
    self.ce = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', device="cuda")

  def sigmoid(self,x):
    return 1/(1+np.exp(-x))

  def vectorStore(self,model,documents,query):
      tempDict = {}
      returnDict = {}

      search_vector = model.encode(query)
      _vector = np.array([search_vector])
      faiss.normalize_L2(_vector)
      distances, ann = self.index.search(_vector, k=10)

      for i,j in zip(distances[0],ann[0]):
        tempDict[documents[j]] = i

      for i in tempDict:
        returnDict[self.sigmoid(self.ce.predict([query,i]))] = i

      values = sorted(list(returnDict.keys()),reverse=True)[:2]

      answer = {}
      for i in values:
        answer[i] = returnDict[i]
      return answer
0

There are 0 answers