How to implement learning to rank using lightgbm?

14.6k views Asked by At

I'm trying to set up learning to rank with lightgbm, I have the following dataset with the interactions of the users based on the query:

df = pd.DataFrame({'QueryID': [1, 1, 1, 2, 2, 2], 
                   'ItemID': [1, 2, 3, 1, 2, 3], 
                   'Position': [1, 2 , 3, 1, 2, 3], 
                   'Interaction': ['CLICK', 'VIEW', 'BOOK', 'BOOK', 'CLICK', 'VIEW']})

The question is to properly set up the dataset for training? The docs mention using Dataset.set_group() but it's not very clear how.

2

There are 2 answers

14
charelf On

I gave this example as answer to another question, even though it does not specifically address the original question it can still be useful I hope!

Here is how I used LightGBM LambdaRank.

First we import some libraries and define our dataset

import numpy as np
import pandas as pd
import lightgbm

df = pd.DataFrame({
    "query_id":[i for i in range(100) for j in range(10)],
    "var1":np.random.random(size=(1000,)),
    "var2":np.random.random(size=(1000,)),
    "var3":np.random.random(size=(1000,)),
    "relevance":list(np.random.permutation([0,0,0,0,0, 0,0,0,1,1]))*100
})

Here is the dataframe:

     query_id      var1      var2      var3  relevance
0           0  0.624776  0.191463  0.598358          0
1           0  0.258280  0.658307  0.148386          0
2           0  0.893683  0.059482  0.340426          0
3           0  0.879514  0.526022  0.712648          1
4           0  0.188580  0.279471  0.062942          0
..        ...       ...       ...       ...        ...
995        99  0.509672  0.552873  0.166913          0
996        99  0.244307  0.356738  0.925570          0
997        99  0.827925  0.827747  0.695029          1
998        99  0.476761  0.390823  0.670150          0
999        99  0.241392  0.944994  0.671594          0

[1000 rows x 5 columns]

The structure of this dataset is important. In learning to rank tasks, you probably work with a set of queries. Here I define a dataset of 1000 rows, with 100 queries, each of 10 rows. These queries could also be of variable length.

Now for each query, we have some variables and we also get a relevance. I used numbers 0 and 1 here, so this is basically the task that for each query (set of 10 rows), I want to create a model that assigns higher relevance to the 2 rows that have a 1 for relevance.

Anyway, we continue with the setup for LightGBM. I split the dataset into a training set and validation set, but you can do whatever you want. I would recommend using at least 1 validation set during training.

train_df = df[:800]  # first 80%
validation_df = df[800:]  # remaining 20%

qids_train = train_df.groupby("query_id")["query_id"].count().to_numpy()
X_train = train_df.drop(["query_id", "relevance"], axis=1)
y_train = train_df["relevance"]

qids_validation = validation_df.groupby("query_id")["query_id"].count().to_numpy()
X_validation = validation_df.drop(["query_id", "relevance"], axis=1)
y_validation = validation_df["relevance"]

Now this is probably the thing you were stuck at. We create these 3 vectors/matrices for each dataframe. The X_train is the collection of your indepedent variables, so the input data for your model. y_train is your dependent variable, what you are trying to predict/rank. Lastly, qids_train are you query ids. They look like this:

array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])

Also this is X_train:

         var1      var2      var3
0    0.624776  0.191463  0.598358
1    0.258280  0.658307  0.148386
2    0.893683  0.059482  0.340426
3    0.879514  0.526022  0.712648
4    0.188580  0.279471  0.062942
..        ...       ...       ...
795  0.014315  0.302233  0.255395
796  0.247962  0.871073  0.838955
797  0.605306  0.396659  0.940086
798  0.904734  0.623580  0.577026
799  0.745451  0.951092  0.861373

[800 rows x 3 columns]

and this is y_train:

0      0
1      0
2      0
3      1
4      0
      ..
795    0
796    0
797    1
798    0
799    0
Name: relevance, Length: 800, dtype: int64

Note that both of them are pandas dataframes, LightGBM supports them, however numpy arrays would also work.

As you can see they indicate the length of each query. If your queries would be of variable lenght, then the numbers in this list would also be different. In my example, all queries are the same length.

We do the exact same thing for the validation set, and then we are ready to start the LightGBM model setup and training. I use the SKlearn API since I am familiar with that one.

model = lightgbm.LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
)

I only use the very minimum amount of parameters here. Feel free to take a look ath the LightGBM documentation and use more parameters, it is a very powerful library. To start the training process, we call the fit function on the model. Here we specify that we want NDCG@10, and want the function to print the results every 10th iteration.

model.fit(
    X=X_train,
    y=y_train,
    group=qids_train,
    eval_set=[(X_validation, y_validation)],
    eval_group=[qids_validation],
    eval_at=10,
    verbose=10,
)

which starts the training and prints:

[10]    valid_0's ndcg@10: 0.562929
[20]    valid_0's ndcg@10: 0.55375
[30]    valid_0's ndcg@10: 0.538355
[40]    valid_0's ndcg@10: 0.548532
[50]    valid_0's ndcg@10: 0.549039
[60]    valid_0's ndcg@10: 0.546288
[70]    valid_0's ndcg@10: 0.547836
[80]    valid_0's ndcg@10: 0.552541
[90]    valid_0's ndcg@10: 0.551994
[100]   valid_0's ndcg@10: 0.542401

I hope I could sufficiently illustrate the process with this simple example. Let me know if you have any questions left.

1
Rajan Garg On

Before converting this data to a group. You have to create a score variable i.e. dependent variable and then generate a train and test file. On the top of it, you need to create two group files for both train and test(Which is looking for the number of times same qid i.e. QueryID is been used.)

Go through this article for more references: https://medium.com/@tacucumides/learning-to-rank-with-lightgbm-code-example-in-python-843bd7b44574