I'm trying to set up learning to rank with lightgbm
, I have the following dataset with the interactions of the users based on the query:
df = pd.DataFrame({'QueryID': [1, 1, 1, 2, 2, 2],
'ItemID': [1, 2, 3, 1, 2, 3],
'Position': [1, 2 , 3, 1, 2, 3],
'Interaction': ['CLICK', 'VIEW', 'BOOK', 'BOOK', 'CLICK', 'VIEW']})
The question is to properly set up the dataset for training? The docs mention using Dataset.set_group()
but it's not very clear how.
I gave this example as answer to another question, even though it does not specifically address the original question it can still be useful I hope!
Here is how I used LightGBM LambdaRank.
First we import some libraries and define our dataset
Here is the dataframe:
The structure of this dataset is important. In learning to rank tasks, you probably work with a set of queries. Here I define a dataset of 1000 rows, with 100 queries, each of 10 rows. These queries could also be of variable length.
Now for each query, we have some variables and we also get a relevance. I used numbers 0 and 1 here, so this is basically the task that for each query (set of 10 rows), I want to create a model that assigns higher relevance to the 2 rows that have a 1 for relevance.
Anyway, we continue with the setup for LightGBM. I split the dataset into a training set and validation set, but you can do whatever you want. I would recommend using at least 1 validation set during training.
Now this is probably the thing you were stuck at. We create these 3 vectors/matrices for each dataframe. The
X_train
is the collection of your indepedent variables, so the input data for your model.y_train
is your dependent variable, what you are trying to predict/rank. Lastly,qids_train
are you query ids. They look like this:Also this is
X_train
:and this is
y_train
:Note that both of them are pandas dataframes, LightGBM supports them, however numpy arrays would also work.
As you can see they indicate the length of each query. If your queries would be of variable lenght, then the numbers in this list would also be different. In my example, all queries are the same length.
We do the exact same thing for the validation set, and then we are ready to start the LightGBM model setup and training. I use the SKlearn API since I am familiar with that one.
I only use the very minimum amount of parameters here. Feel free to take a look ath the LightGBM documentation and use more parameters, it is a very powerful library. To start the training process, we call the fit function on the model. Here we specify that we want NDCG@10, and want the function to print the results every 10th iteration.
which starts the training and prints:
I hope I could sufficiently illustrate the process with this simple example. Let me know if you have any questions left.