I want to divide my unbalanced dataset into three sets, training, validation and a test set. I would like the class ratio to be preserved after the split.

An obvious solution is to use Scikit-learns StratifiedShuffleSplit or StratifiedKfold twice, but that would result in two of the three not being exactly equal.

I see two ways of doing that, either going with an 80/20 split first, and an 80/20 split the second time, resulting in a 64/16/20 split overall. That would imply that I'm validation my result on less data than I see in the future. It has its pros and cons.

The second possibility is to do an 80/20 split initially, and then do a 75/25 split, which would result in a 60/20/20 split, as desired. But is that leads to another question, do I violate something I'm not aware of doing two different splits?

I tried to write up the problem, with the intention of solving it using LP but that didn't turn out good - I'm not sure the problem is an LP problem when one coefficient is a ratio of two coefficients one tries to optimize.

I have tried to search here, on Stats-Stackoverflow and DS-Stackoverflow but without luck.

I'm keen on hearing your thoughts on the above problem, perhaps I'm overthinking it and it actually has no practical implications?

0 Answers