I'm having a problem with partikit
weighted conditional tree models trained on data with missing values.
I'm manually creating a bagged tree model by giving different integer weights to observations at each cycle.
But when I used the bootstrapped models to make predictions, I noticed that some of them were returning less values than the input data rows. Interestingly, out of 299 rows in the input data, the predicted data length was either 299 or 289. 289 is the number of rows after removing predictors with missing data.
Digging down the problem I found that it arises from the interaction of three components:
- Using weights in the model;
- Having missing data in the predictors;
- Using character variables instead of factors in the input data passed to
predict()
If only one of these three conditions is missing the problem doesn't arise and all trees return 299 values.
Here is the data: https://www.dropbox.com/s/98oriv2msce4wu5/anonym_data.rds?dl=0 Here is a script to reproduce the problem: https://www.dropbox.com/s/5y7g2dwt2838pbp/test.R?dl=0
The links no longer work, but I think you meant
partykit
. Even thoughctree
models can deal with missing data, there seem to be difficulties with the use ofpredict.party
. The code uses a call tomodel.frame
with the defaultna.action
tona.fail
.I'm not good enough to say whether that's a bug, but it seems strange to me, and will likely fix the issue you are seeing. You can download the
partykit
source code, modify this line, adding the optionna.action = na.pass
.Although I hope you are not still having this issue 1y 5m in the future.