How to evaluate Retriever in Haystack python using retriever.eval() method

169 views Asked by At

I am trying to understand Haystack package for building Question and Answering systems. As part of evaluation of the QA system, I want to evaluate the performance of retrievers independently for sparse based retrievers and dense retrievers. Let us we are working on SubjQA dataset and we have imported the necessary packages and code is as follows:

from datasets import load_dataset
from haystack.nodes import DensePassageRetriever
from haystack.document_stores import ElasticsearchDocumentStore
from haystack import Label,Document,Span,Answer
from haystack.nodes import DensePassageRetriever


subjqa  = load_dataset("subjqa",name = 'electronics')
dfs = {split : df_.to_pandas() for split,df_ in subjqa.flatten().items()}

#columns we are focusing....
qa_cols = ['title','question','answers.text','answers.answer_start','context']

#creating doc store by connecting to the Elastic Search document store...
document_store = ElasticsearchDocumentStore(host='localhost',port = '9200',username='elastic',password='',index='document')

#writing files to document store....
for split, df in dfs.items():
    # Exclude duplicate reviews
    docs = [{"content": row["context"],"meta":{"item_id": row["title"], "question_id": row["id"],"split": split}}
    for _,row in df.drop_duplicates(subset="context").iterrows()]
    
    document_store.write_documents(docs, index="document")
    
print(f"Loaded {document_store.get_document_count()} documents")

#creating labels for the dataset...
labels = []
for i, row in dfs["test"].iterrows():
    
    # Metadata used for filtering in the Retriever
    meta = {"item_id": row["title"], "question_id": row["id"]}
    
    # Populate labels for questions with answers
    if len(row["answers.text"]):
        for idx,answer in enumerate(row["answers.text"]):
            
            span_start = row["answers.answer_start"][idx]
            span_end   = span_start + len(answer)
            
            label = Label(query = row["question"],
                          answer= Answer(answer=answer,offsets_in_context =[ Span(span_start,span_end)]),
                          
                          document = Document(id = i,
                                             content_type="text",
                                             content = row['context']),
                          origin="gold-label",
                          meta=meta,
                          is_correct_answer=True,
                          is_correct_document=True,
                          no_answer=False)
                          
    
            labels.append(label)
    # Populate labels for questions without answers
    else:
        
        label = Label(
            query=row["question"],
            answer=Answer(answer=""),
            document=Document(
                id=i,
                content_type="text",
                content=row["context"]
            ),
            origin="gold-label",
            meta=meta,
            is_correct_answer=True,
            is_correct_document=True,
            no_answer=True)
        
        labels.append(label)
        
#writing labels to document store        
document_store.write_labels(labels=labels,index="labels")
print(f"""Loaded {document_store.get_label_count(index="labels")} question-answer pairs""")

#initializng dense passage retriever..
dpr_retriever = DensePassageRetriever(document_store=document_store,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",embed_title=False)

#updating docstore embeddings...
document_store.update_embeddings(dpr_retriever)

#evaluating retriever.....
eval_ = dpr_retriever.eval(label_index="labels",doc_index="document",top_k=3)

#print eval_
print(eval_)

The output for the following code is

{'recall': 0.0, 'map': 0.0, 'mrr': 0.0, 'retrieve_time': 7.709108700022625, 'n_questions': 160, 'top_k': 3}

The metrics are showing 0 for recall, precision. I am totally confused by result because if I had used pipeline.eval() in haystack it is giving meaningful result.

How to evalaute the retriever independently in haystack without using pipeline.eval()? Is there any error in creating labels manually in the above code? Is there method to write our own evaluator functions for retriever and reader separately?

packages information: haystack==1.21.2 elastic searh = 7.9

0

There are 0 answers