I would like to partition panel data and preserve the panel nature of the data:
library(caret)
library(mlbench)
#example panel data where id is the persons identifier over years
data <- read.table("http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
## Here for instance the dependent variable is working
inTrain <- createDataPartition(y = data$WORKING, p = .75,list = FALSE)
# subset into training
training <- data[ inTrain,]
# subset into testing
testing <- data[-inTrain,]
# Here we see some intersections of identifiers
str(training$id[10:20])
str(testing$id)
However I would like, when partitioning or sampling the data, to avoid that the same person (id) is splitted into two data sets.Is their a way to randomly sample/partition from the data an assign indivuals to the corresponding partitions rather then observations?
I tried to sample:
mysample <- data[sample(unique(data$id), 1000,replace=FALSE),]
However, that destroys the panel nature of the data...
I think there's a little bug in the sampling approach using
sample()
: It is using theid
variable like a row number. Instead, the function needs to fetch all rows belonging to an ID:Let's check class balances, because
createDataPartition
would keep the class balance for WORKING equal in all sets.