I have a data set of subjects and each of them has a number of rows in my pandas dataframe (each measurement is a row and a subject could measure a few times). I would like to split my data into training and test set but I cannot split randomly because all subject's measurements are dependent (cannot put the same subject in the train and test). How would you reslove this? I have a pandas dataframe and each subject has a different number of measurements.
Edit: My data includes the subject number for each row and I would like to split as close to 0.8/0.2 as possible.
Consider the dataframe
df
with columnuser_id
to identify users.You want to identify unique users and randomly select some. Then split your dataframe in order to put all test users in one and train users in the other.
This should roughly split your data into 80/20.
However, if you care to keep it as balanced as possible, then you must add users incrementally.