h2o.ai Platt Scaling calibration

1.1k views Asked by At

I noticed a relatively recend add to the h2o.ai suite, the ability to perform supplementary Platt Scaling to improve the calibration of output probabilities. (See calibrate_model in h2o manual.) Nevertheless few guidance is avaiable on the online help docs. In particular I wonder whether when Platt Scaling is enabled:

  • How it affects the models' leaderboard? That is, is the platt scaling calculated after the ranking metric or before?
  • How it affects computing performance?
  • Can the calibration_frame be the same as validation_frame or should not (both under a computation or theoretical point of view)?

Thanks in advance

1

There are 1 answers

1
Erin LeDell On BEST ANSWER

Calibration is a post-processing step run after the model finishes. Therefore it doesn't affect the leaderboard and and it has no effect on the training metrics either. It adds 2 more columns to the scored frame (with calibrated predictions).

This article provides guidance how to construct a calibration frame:

  1. Split dataset into test and train
  2. Split the train set into model training and calibration.

It also says: The most important step is to create a separate dataset to perform calibration with.

I think the calibration frame should be used only for calibration, and hence distinct from the validation frame. The conservative answer is that they should be separate -- when you use a validation frame for early stopping or any internal model tuning (e.g. lambda search in H2O GLM), that validation frame becomes an extension of the "training data" so it's kind of off-limits at that point. However you could try both versions and directly observe what the effect is, then make a decision. Here's some additional guidance from the article:

"How much data to use for calibration will depend on the amount of data you have available. The calibration model will generally only be fitting a small number of parameters (so you do not need a huge volume of data). I would aim for around 10% of your training data, but at a minimum of at least 50 examples."