Pyspark one-hot encoding with grouping same id

28 views Asked by At

Is there a way to perform OHE in Spark and 'flatten' dataset so that each Id has only one row?

For example if input is like this:

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       b|
|  2|       c|
|  1|       a|
|  2|       a|
|  0|       c|
+---+--------+

Output should be like this (id0 has categories a and c, id1 has a and b, etc.):

+---+----------+----------+----------+
| id|category_a|category_c|category_b|
+---+----------+----------+----------+
|  0|         1|         1|         0|
|  1|         1|         0|         1|
|  2|         1|         1|         0|
+---+----------+----------+----------+

I can do this in pandas by OHE + groupby (aggr - 'max'), but can't find a way to do it in pyspark due to the specific output format..

Thank you, appreciate any help.

2

There are 2 answers

1
Arunbh Yashaswi On BEST ANSWER
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
encoder = OneHotEncoder(inputCols=["categoryIndex"], outputCols=["categoryVec"])

pipeline = Pipeline(stages=[indexer, encoder])
model = pipeline.fit(df)
transformed_df = model.transform(df)

result = transformed_df.groupBy("id").pivot("category").agg(count("categoryVec"))

result.show()

Converting the values to indices using StringIndexer, applying the OHE and then pivoting around id and lastly aggregating everything together.

Changed from max to count as per your ask

0
Alex_Y On

Ok, seems this can be done with groupby + pivot:

df.groupBy('id').pivot('category').count().show()

Small question how to properly store this encoder to use in inference part..
Just list of columns and the compare with inference, creating missing with 0?
Any better ideas?