Error when replacing new factor levels in test dataset with `NA`

553 views Asked by At

I have split my data set into testing and training data sets. I've tried to fit a regression on the training set, and then use predict on the testing set. When I do this I get an error message that says: "Error in model.frame factor x has New Levels". I know this is because there are levels in my testing data not seen in my training data.

What I want to do is just eliminate or ignore the levels that aren't in both data sets. I've tried to do this, but it isn't setting any levels to NA, and the id object says "integer (empty)":

id <- which(!(test$x %in% levels (train$x))
train$x[id] <- NA

fit <- lm(y ~ x, data=train)
P <- predict(fit,test)
1

There are 1 answers

2
Zheyuan Li On BEST ANSWER

You will get "replacement length differs" error with your code.

id <- which(!(test$x %in% levels (train$x))

tells you what elements in test$x are not in levels(train$x), so you should use id to index test$x, not train$x, when doing replacement.

test$x[id] <- NA
test$x <- droplevels(test$x)  ## also don't forget to remove unused factor levels

fit <- lm(y ~ x, data = train)
P <- predict(fit, test)

All data in train will be used to build your linear regression model. Some predictions in P will be NA.


I'm still unable to get the id object to correctly identify which levels are not in both data sets. In the work-space it just shows integer(0).

Then, what is the point of your question??!! All levels in test$x are inside levels(train$x) and there is no new level.