I am new to Pyspark. I have a dataset that contains categorical features and I want to use regression models from pyspark to predict continuous values. I am stuck in pre-processing of data that is required for using MLlib models.
Is it necessary to convert categorical attributes to numerical attributes to use LabeledPoint function in Pyspark?
99 views Asked by jdatastic17 At
1
Yes, it is necessary. You have to not only convert to numerical but also encode to make them useful for linear models. Both steps are implemented in
pyspark.ml
(notmllib
) with:pyspark.ml.feature.StringIndexer
- indexing.pyspark.ml.feature.OneHotEncoder
- encoding.