Is there a way to perform OHE in Spark and 'flatten' dataset so that each Id has only one row?
For example if input is like this:
+---+--------+
| id|category|
+---+--------+
| 0| a|
| 1| b|
| 2| c|
| 1| a|
| 2| a|
| 0| c|
+---+--------+
Output should be like this (id0 has categories a and c, id1 has a and b, etc.):
+---+----------+----------+----------+
| id|category_a|category_c|category_b|
+---+----------+----------+----------+
| 0| 1| 1| 0|
| 1| 1| 0| 1|
| 2| 1| 1| 0|
+---+----------+----------+----------+
I can do this in pandas by OHE + groupby (aggr - 'max'), but can't find a way to do it in pyspark due to the specific output format..
Thank you, appreciate any help.
Converting the values to indices using StringIndexer, applying the OHE and then pivoting around id and lastly aggregating everything together.
Changed from max to count as per your ask