Predicted values for new data using restricted cubic splines

2.7k views Asked by At

I have some data that I'm modeling using restricted cubic splines. I'm using the rcs transformation function in the rms package to generate the transformed variables for a linear model. Here is an example using 5 knots.

library('rms')

my_df <- data.frame(
    y = -4 * -100:100 + -1.5 * (-100:100)**2 + 3 * (-100:100)**3 + rnorm(201, 0, 1e5),
    x = -100:100
)

mod <- lm(y ~ rcs(x, 5), data = my_df)

After I fit the data, I'd like to find the predicted y values for a specific domain of x values. Here is what I'm doing now:

new_data <- data.frame(x = -3:3)

predict(mod, newdata = new_data)

However, this generates a warning message:

Warning message:
In rcspline.eval(x, nk = nknots, inclx = TRUE, pc = pc, fractied = fractied) :
    5 knots requested with 7 unique values of x.  knots set to 5 interior values.

What does this mean, and what is going on? I expected that the knot locations should already be defined in mod, so I don't understand why it seems to be trying to find new knots for the seven x values that I give it. I can avoid the warning message by providing more x values in new_data, and just ignoring the ones I don't need, but I am concerned about what predict is actually doing.

2

There are 2 answers

1
Miff On BEST ANSWER

According to Hadley's comment on this question you shouldn't expect lm to work with rcs. A quick demonstration why there's a problem:

mod <- lm(y ~ rcs(x, 5), data = my_df)

new_data <- data.frame(x = -3:3)
new_data2 <- data.frame(x = -300:300/100)

plot(new_data2$x, predict(mod, newdata = new_data2), type='l')
lines(new_data$x,predict(mod, newdata = new_data), col='red')

Graph produced as code output

The predictions vary depending on the number of x-values, even for the same range, so definitely not a good idea to combine these functions.

0
Crt Ahlin On

I believe the predict function will look in the formula and replace the variables it finds there with the ones in the newdata. The trick is, the rcs function determines where the knots are based on the provided data (distribution of it). So if the data in new_data has a different distribution than data in my_df, the knots are going to be at different locations and it will change the curve. Anyway, fixating the knot locations fixes the problem.

To fixate the knot locations, you cant't use the rcs function, but the rcspline.eval function, that takes knot locations as an argument. You can use the same function to calculate where the knots "should" be. See code below.

Knots <- rcspline.eval(my_df$x, knots.only = TRUE) # returns only locations of knots
# see ??Hmisc::rcspline.eval for details of how it determines knot locations
mod2 <- lm(y ~ rcspline.eval(x, knots = Knots), data = my_df) # fit model
predict(mod2, newdata = new_data) # predict based on mod2 and new data

Since the mod2 formula contains the knot locations, the curve should be the same shape.