Is it necessary to convert categorical attributes to numerical attributes to use LabeledPoint function in Pyspark?

110 views Asked by At

I am new to Pyspark. I have a dataset that contains categorical features and I want to use regression models from pyspark to predict continuous values. I am stuck in pre-processing of data that is required for using MLlib models.

1

There are 1 answers

0
user7337271 On BEST ANSWER

Yes, it is necessary. You have to not only convert to numerical but also encode to make them useful for linear models. Both steps are implemented in pyspark.ml (not mllib) with:

  • pyspark.ml.feature.StringIndexer - indexing.
  • pyspark.ml.feature.OneHotEncoder - encoding.