How does XGBoost/lightGBM evaluate ndcg for ranking tasks?

Question

How does XGBoost/lightGBM evaluate ndcg for ranking tasks?

4.7k views Asked by RDizzl3 At 29 December 2024 at 07:29

I am currently running tests between XGBoost/lightGBM for their ability to rank items. I am reproducing the benchmarks presented here: https://github.com/guolinke/boosting_tree_benchmarks.

I have been able to successfully reproduce the benchmarks mentioned in their work. I want to make sure that I am correctly implementing my own version of the ndcg metric and also understanding the ranking problem correctly.

My questions are:

When creating the validation for the test set using ndcg - there is a test.group file that says the first X rows are group 0, etc. To get the recommendations for the group, I get the predicted values and known relevance scores and sort that list by descending predicted values for each group?
In order to get the final ndcg scores from the lists created above - do I get the ndcg scores and take the mean over all the scores? Is this the same evaluation methodology that XGBoost/lightGBM in the evaluation phase?

Here is my methodology for evaluating the test set after the model has finished training.

For the final tree when I run lightGBM I obtain these values on the validation set:

[500]   valid_0's ndcg@1: 0.513221  valid_0's ndcg@3: 0.499337  valid_0's ndcg@5: 0.505188  valid_0's ndcg@10: 0.523407

My final step is to take the predicted output for the test set and calculate the ndcg values for the predictions.

Here is my python code for calculating ndcg:

import numpy as np

def dcg_at_k(r, k):
    r = np.asfarray(r)[:k]
    if r.size:
        return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
    return 0.


def ndcg_at_k(r, k):
    idcg = dcg_at_k(sorted(r, reverse=True), k)
    if not idcg:
        return 0.
    return dcg_at_k(r, k) / idcg

After I get the predictions for the test set for a particular group (GROUP-0) I have these predictions:

query_id    predict
0   0   (2.0, -0.221681199441)
1   0   (1.0, 0.109895548348)
2   0   (1.0, 0.0262799346312)
3   0   (0.0, -0.595343431322)
4   0   (0.0, -0.52689043426)
5   0   (0.0, -0.542221350664)
6   0   (1.0, -0.448015576024)
7   0   (1.0, -0.357090949646)
8   0   (0.0, -0.279677741045)
9   0   (0.0, 0.2182200869)

NOTE

Group-0 actually has about 112 rows.

I then sort the list of tuples in descending order which provides a list of relevance scores:

def get_recommendations(x):

    sorted_list = sorted(list(x), key=lambda i: i[1], reverse=True)
    return [k for k, _ in sorted_list]

relavance = evaluation.groupby('query_id').predict.apply(get_recommendations)

query_id
0    [4.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
1    [4.0, 2.0, 2.0, 2.0, 1.0, 1.0, 3.0, 2.0, 1.0, ...
2    [2.0, 3.0, 2.0, 2.0, 1.0, 0.0, 2.0, 2.0, 1.0, ...
3    [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
4    [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...

Finally, for each query id I calculated the ndcg scores on the relevance list and then take the mean of all the ndcg scores calculated for each query id:

relavance.apply(lambda x: ndcg_at_k(x, 10)).mean()

The value I obtain is ~0.497193.

Original Q&A

There are 2 answers

**mcskinner** · Answer 1 · 2021-02-11T19:52:36+00:00

Cross-posting my Cross Validated answer to this cross-posted question: https://stats.stackexchange.com/questions/303385/how-does-xgboost-lightgbm-evaluate-ndcg-metric-for-ranking/487487#487487

I happened across this myself, and finally dug into the code to figure it out.

The difference is the handling of a missing IDCG. Your code returns 0, while LightGBM is treating that case as a 1.

The following code produced matching results for me:

import numpy as np

def dcg_at_k(r, k):
    r = np.asfarray(r)[:k]
    if r.size:
        return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
    return 0.


def ndcg_at_k(r, k):
    idcg = dcg_at_k(sorted(r, reverse=True), k)
    if not idcg:
        return 1.  # CHANGE THIS
    return dcg_at_k(r, k) / idcg

**MaxInsulator** · Answer 2 · 2017-09-26T13:21:58+00:00

MaxInsulator On 26 September 2017 at 13:21

I think the problem is caused by data in the same query that have same labels. In that case, Both XGBoost and LightGBM will produce ndcg 1 for that query.

TechQA.

How does XGBoost/lightGBM evaluate ndcg for ranking tasks?

There are 2 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in RECOMMENDATION-ENGINE

Related Questions in XGBOOST

Related Questions in LIGHTGBM

Popular Questions

Popular Tags

Trending Questions