I would like to train a classifier on an RDD[LabeledPoint] using only a subset of the features in each LabeledPoint (both to quickly adjust the model, and to include items in each LabeledPoint such as IDs or evaluation metrics that are not features). I have searched the documentation and cannot find a way to specify which columns should be included or ignored. Code is below, am using Spark and MLLib 1.3.1, Scala 2.10.4.
If specific feature exclusion is not possible, it would still be helpful to include an ID with each data point that is ignored during training. Any help is appreciated!
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int](5 -> 2)
val numTrees = 100
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 6
val maxBins = 20
val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
Do you want to select subset of features one before building the model or you want to have some custom strategy for RandomForest classifier to use between iterations?
If it is the first case - you can transform trainingData with map transformation before building the model.
See feature selection section in MLlib - Feature Extraction and Transformation for examples of feature selection.