I am trying to use the MOB procedure from the R package partykit to predict survival probabilities based on a set of covariates X1,...,X25 and a treatment effect W. The linear predictor in each node in MOB only uses W, X1, and X2, but each covariate is used for selection for node splitting. I would like to force the MOB to only split according to parameter instability for the treatment effect W. When doing prediction in the final line of code below, I get the following error:
Error in rval[ix[[i]], ] <- preds[[i]] :
number of items to replace is not a multiple of replacement length
In addition: Warning messages:
1: 'newdata' had 1070 rows but variables found have 1029 rows
2: 'newdata' had 1337 rows but variables found have 1291 rows
3: 'newdata' had 1690 rows but variables found have 1680 rows
4: 'newdata' had 903 rows but variables found have 1000 rows
I believe this error occurs because the number of test observations falling in each terminal node is different than that of the training observations. How can I modify the predict statement to handle this issue and obtain predictions on the test set? I would also like to know if I'm using the parm option correctly in specifying that parameter instability should be assessed according to W.
library("survival")
library("partykit")
n=5000;n.test=5000;p=25;pi=0.5;beta=1
gamma=0.5;rho=2;cen.scale=4;n.mc=10000;
Y.max=2
generate_data <- function(n, p, pi = 0.5, beta = 1, gamma = 1, rho = 2, cen.scale = 4,
Y.max = NULL){
W <- rbinom(n, 1, pi)
X <- matrix(rnorm(n * p), n, p)
numerator <- -log(runif(n))
cox.ft <- (numerator / exp(beta * X[ ,1] + (-0.5 - gamma * X[ ,2]) * W))^2
failure.time <- pmin(cox.ft, Y.max)
numeratorC <- -log(runif(n))
censor.time <- (numeratorC / (cen.scale ^ rho)) ^ (1 / rho)
Y <- pmin(failure.time, censor.time)
D <- as.integer(failure.time <= censor.time)
list(X = X, Y = Y, W = W, D = D)
}
data <- generate_data(n, p=p, pi = pi, beta = beta, gamma = gamma, rho = rho, cen.scale = cen.scale,
Y.max = Y.max)
data.test <- generate_data(n.test, p=p, pi = pi, beta = beta, gamma = gamma, rho = rho, cen.scale = cen.scale,
Y.max = Y.max)
X=data$X
Y=data$Y
W=data$W
D=data$D
var_prog <- c("X1","X2")
colnames(X) <- paste("X", 1:25, sep="")
cov.names <- colnames(X)
wbreg <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
survreg(y ~ 0 + x, weights = weights, dist = "weibull", ...)
}
dat <- data.frame(Y=Y,D=D,W=W,X)
eqn <- paste0("Surv(Y, D) ~ W + ",paste0(var_prog, collapse = "+")," | ",
paste0(cov.names, collapse = "+"))
glmtr <- partykit::mob(as.formula(eqn), data = dat,
fit = wbreg, control = mob_control(parm=2,minsize = 0.2*nrow(dat),
alpha = 0.10, bonferroni = TRUE))
plot(glmtr)
dat.test <- data.frame(Y=data.test$Y,D=data.test$D, W=data.test$W,data.test$X)
pct <- 1:98/100
quantile_pred <- predict(glmtr, newdata = dat.test, type = "quantile",p=pct)
The problem is that
dat.testonly provides the original variables thatmob()has seen (i.e.,Y,D,W, etc.) whilesurvreg()has seen the processed variablesyandx.The
predict()method formob()trees internally first predicts the node ID (which works smoothly in your example) and then passes on the correct subsets ofnewdatato thepredict()method for the fitted model objects (fromsurvreg()in this case). As the latter does not find the variablesyandxinnewdatait takes them from the learning data. Hence you get the warnings/errors about the mismatching dimensions.So there are two ways to deal with this:
survregoutput believe it was fitted with the formulaSurv(Y, D) ~ W + X1 + X2ornewdatato providex.Strategy 1 is what
lmtree()andglmtree()do internally. You have to be careful, though, that everything still works correctly when changing the supposed formula and terms. Hence, it is easier to apply strategy 2 safely, which is what I would recommend here.Caveat: The
predict()method forsurvreg()objects with a multivariateponly returns a matrix ifnewdatahas more than one row. Ifnewdatahas just a single row it returns a vector. This confuses thepredict()method formob()if it happens in the first node wherepredict()is applied because this determines the dimension of the output object. If it happens in subsequent nodes it is no problem. Also, univariatepis never a problem.Bonus: Yes, you are using
parmas intended. However, note that this only affects the parameter instability tests. Thus, the splitting variables in the tree are selected based on how much theWeffect changes along those variables. But for selecting the split point in the variable the full log-likelihood of the model (including all regressors) is maximized. Thus, the split point may be sensitive to changes in all coefficients, not just the one ofW.