I want to train models on different subsets of data using mlr3
, and I was wondering if there a way to train models on different subsets of data in a pipeline.
What I want to do is similar to the example from R for Data Science - Chapter 25: Many models. Say we use the same data set, gapminder
, a data set containing different variables for countries around the world, such as GDP and life expectancy. If I wanted to train models for life expectancy for each country, is there an easy way to create such a pipeline using mlr3
?
Ideally, I want to use mlr3pipelines
to create a branch in the graph for each subset (e.g. a separate branch for each country) with a model at the end. Therefore, the final graph will start at a single node, and have n
trained learners at the end nodes, one for each group (i.e. country) in the data set, or a final node that aggregates the results. I would also expect it to work for new data, for example if we obtain new data in the future for 2020, I would want it to be able to create predictions for each country using the model trained for that specific country.
All the mlr3
examples I have found seem to deal with models for the entire data set, or have models trained with all the groups in the training set.
Currently, I am just manually creating a separate task for each group of data, but it would be nice to have the data subsetting step incorporated into the modelling pipeline.
It would help if you had functions from these two packages:
dplyr
andtidyr
. The following code shows you how to train multiple models by country:Note that
learn
is a function that takes a single dataframe as its input. I will show you how to define that function later. Now you need to know that the returned dataframe from this pipeline is as follows:To define the
learn
function, I follow the steps provided on the mlr3 website. The function isI hope this solve your problem.
New
Consider the following steps to train your model and predict the result for each country.