How does one select a sample size and sample set (for training and testing) for a binary classification problem to be solved by applying supervised learning?
The current implementation is based on 15 binary features which we may expand to 20 or possibly 24 binary features in order to improve accuracy metrics. The classification is based on a look up in decision table which we would like to replace by a decision with a machine learning classifier. Part of the goal is also to gauge our current accuracy metrics.
a) What is the minimal sample size to choose for the supervised training so as to balance the desired accuracy and cost ? b) How do we select the actual samples to use for training/test set?
Computational learning theory defines the minimal sample given the hypothesis space, the desired probability of keeping errors below a certain threshold. Please provide an explanation and possible examples applying the formulas.
The sample classification training/test set will be collected with a human decision. So, obviously there is a cost involved with selecting this sample set. And then funding the project becomes harder when cost and benefit cannot be easily put down on paper.
There is no easy way to determine a minimal sample size since there are no hard and fast rules regarding sample sizes when it comes to machine learning. Many classifiers can be applied to binary classification, e.g. SVM, and there are a number of sampling techniques which can be applied, depending on the structure of the data, the underlying system and the aims of the analysis. Your reference to the selection of the set itself is somewhat confusing: are you asking how to determine the minimum amount of data required to build an accurate classifier? The answer depends on the classifier being used and the learning ability of the classifier. Also, models trained on smaller models may not generalize as well as those trained on larger sets, even if you get adequate error rates, so if you are primarily interested in accurate classification of previously unseen records, you will want to keep this in mind. As for selecting a training sample set, this depends on the structure of the data and the sampling method used. You might prefer to use cross-validation techniques when training the model because of over-fitting.