How to split a dataset to train/test where some rows are dependent?

2.1k views Asked by At

I have a data set of subjects and each of them has a number of rows in my pandas dataframe (each measurement is a row and a subject could measure a few times). I would like to split my data into training and test set but I cannot split randomly because all subject's measurements are dependent (cannot put the same subject in the train and test). How would you reslove this? I have a pandas dataframe and each subject has a different number of measurements.

Edit: My data includes the subject number for each row and I would like to split as close to 0.8/0.2 as possible.

1

There are 1 answers

1
piRSquared On BEST ANSWER

Consider the dataframe df with column user_id to identify users.

df = pd.DataFrame(
    np.random.randint(5, size=(100, 4)), columns=['user_id'] + list('ABC')
)

You want to identify unique users and randomly select some. Then split your dataframe in order to put all test users in one and train users in the other.

unique_users = df['user_id'].unique()
train_users, test_users = np.split(
    np.random.permutation(unique_users), [int(.8 * len(unique_users))]
)

df_train = df[df['user_id'].isin(train_users)]
df_test = df[df['user_id'].isin(test_users)]

This should roughly split your data into 80/20.


However, if you care to keep it as balanced as possible, then you must add users incrementally.

unique_users = df['user_id'].unique()
target_n = int(.8 * len(df))
shuffled_users = np.random.permutation(unique_users)

user_count = df['user_id'].value_counts()

mapping = user_count.reindex(shuffled_users).cumsum() <= target_n
mask = df['user_id'].map(mapping)

df_train = df[mask]
df_test = df[~mask]