I am trying to fit a gradient boosting machine (GBM) to insurance claims. The observations have unequal exposure so I am trying to use an offset equal to the log of exposures. I tried two different ways:
Put an offset term in the formula. This resulted in
nan
for the train and validation deviance for every iteration.Use the
offset
parameter in thegbm
function. This parameter is listed undergbm.more
. This results in an error message that there is an unused parameter.
I can't share my company's data but I reproduced the problem using the Insurance data table in the MASS package. See the code and output below.
library(MASS)
library(gbm)
data(Insurance)
# Try using offset in the formula.
fm1 = formula(Claims ~ District + Group + Age + offset(log(Holders)))
fitgbm1 = gbm(fm1, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
verbose = TRUE)
# Try using offset in the gbm statement.
fm2 = formula(Claims ~ District + Group + Age)
offset2 = log(Insurance$Holders)
fitgbm2 = gbm(fm2, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
offset = offset2,
verbose = TRUE)
This then outputs:
> source('D:/Rprojects/auto_tutorial/rcode/example_gbm.R')
Iter TrainDeviance ValidDeviance StepSize Improve
1 -347.8959 nan 0.1000 0.0904
2 -348.2181 nan 0.1000 0.0814
3 -348.3845 nan 0.1000 0.0616
4 -348.5424 nan 0.1000 0.0333
5 -348.6732 nan 0.1000 0.0850
6 -348.7744 nan 0.1000 0.0610
7 -348.8795 nan 0.1000 0.0633
8 -348.9132 nan 0.1000 -0.0109
9 -348.9200 nan 0.1000 -0.0212
10 -349.0271 nan 0.1000 0.0267
Error in gbm(fm2, distribution = "poisson", data = Insurance, n.trees = 10, :
unused argument (offset = offset2)
My question is what am I doing wrong? Also, is there another way? I noticed a weights parameter in the gbm
function. Should I use that?
Your first suggestion works if you specify a training fraction less than 1. The default is 1, which means there is no validation set.
results in