Spark MLLib How to ignore features when training a classifier

Question

Spark MLLib How to ignore features when training a classifier

583 views Asked by user1109152 At 09 June 2015 at 17:49

I would like to train a classifier on an RDD[LabeledPoint] using only a subset of the features in each LabeledPoint (both to quickly adjust the model, and to include items in each LabeledPoint such as IDs or evaluation metrics that are not features). I have searched the documentation and cannot find a way to specify which columns should be included or ignored. Code is below, am using Spark and MLLib 1.3.1, Scala 2.10.4.

If specific feature exclusion is not possible, it would still be helpful to include an ID with each data point that is ignored during training. Any help is appreciated!

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int](5 -> 2)
val numTrees = 100
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 6
val maxBins = 20
val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

Original Q&A

There are 1 answers

**vvladymyrov** · Answer 1 · 2015-06-10T02:44:07+00:00

Do you want to select subset of features one before building the model or you want to have some custom strategy for RandomForest classifier to use between iterations?

If it is the first case - you can transform trainingData with map transformation before building the model.

See feature selection section in MLlib - Feature Extraction and Transformation for examples of feature selection.

TechQA.

Spark MLLib How to ignore features when training a classifier

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in MACHINE-LEARNING

Related Questions in APACHE-SPARK-MLLIB

Popular Questions

Popular Tags

Trending Questions