I'm working on an NLP project where I hope to use MaxEnt to categorize text into one of 20 different classes. I'm creating the training, validation and test sets by hand from administrative data that is hand written.
I would like to determine the sample size required for the classes in the training set and the appropriate size of the validation/testing set.
In the real world, the 20 outcomes are imbalanced. But I'm considering creating a balanced training set to help build the model.
So I have two questions:
How should I determine the appropriate sample size for each category in the training set?
Should the validation/testing sets be imbalance to reflect the conditions the model might encounter if faced with real world data?
In order to determine the sample size of your test set you could use Hoeffding's inequality.
Let E be the positive tolerance value and N the sample size of the data set.
Then we can compute Hoeffding's inequality, p = 1 - ( 2 * EXP( -2 * ( E^2 ) * N) ).
Let E = 0.05 (±5%) and N = 750, then p = 0.9530. This means that with a certainty of 95.3% your (in-sample) test error won't deviate more than 5% out of sample.
As for the sample size of the training and validation set there is an established convention to split the data as follows: 50% for training, and 25% each for validation and testing. The optimal size of those sets depends a lot on the the training set and the amount of noise in the data. For further information have a look at "Model Assessment and Selection" in "Elements of statistical learning".
As for your other question regarding imbalanced datasets have a look at this thread: https://stats.stackexchange.com/questions/6254/balanced-sampling-for-network-training