I've created several GBM models to tune the parameters (trees, shrinkage and depth) to my data and the model performs well on the out-of-time sample. The data is credit card transactions (running into 100s of millions) so I sampled 1% of the good (non-event) and 100% of the bad.
However, when I increased the sample size to 3% of the good, there was a noticeable improvement in performance. My question is - how do I decide the optimal sampling rate, without running several iterations and deciding which one fits best? Is there a theory around this?
I have about 3 million total transactions (for the 1% sample), containing 380k bads and ~250 variables