How to Sub-Sample Dataset

172 views Asked by At

I'm going to implement svm(support vector machines) and various other classifying algorithms. But my train dataset is of 10Gb. How can I sub-sample it ? This is a very basic level question but I'm a beginner.

Thank for the help

2

There are 2 answers

0
Matthew Spencer On

It depends on your data.

Since you're working on a basic-level question, I guess the best approach to start with is to cut down your sample size considerably. Once that is done, reduce the number of features to a nominated size.

Once the dataset is small and simple enough, you could then consider adding more attributes or samples as are fitting for the problem at hand.

Hope this Helps!

0
Daniel Moraes On

The first thing you should do is reduce the number of samples (rows). LibSVM provides a very useful python script for that. If your dataset has N samples and you want to downsample it to N - K samples, you can use the aforementioned script to: (1) randomly remove K samples from your data; (2) remove K samples from your data using stratified sampling. The last one is recommended.

It is much more complicated to reduce the number of features (columns). You can't (you shouldn't) remove them randomly. There are many algorithms for that, which are usually called data reduction algorithms. The most used one is PCA. But it's not as simple to use.