Pyspark one-hot encoding with grouping same id

Question

Pyspark one-hot encoding with grouping same id

28 views Asked by Alex_Y At 22 January 2024 at 15:49

Is there a way to perform OHE in Spark and 'flatten' dataset so that each Id has only one row?

For example if input is like this:

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       b|
|  2|       c|
|  1|       a|
|  2|       a|
|  0|       c|
+---+--------+

Output should be like this (id0 has categories a and c, id1 has a and b, etc.):

+---+----------+----------+----------+
| id|category_a|category_c|category_b|
+---+----------+----------+----------+
|  0|         1|         1|         0|
|  1|         1|         0|         1|
|  2|         1|         1|         0|
+---+----------+----------+----------+

I can do this in pandas by OHE + groupby (aggr - 'max'), but can't find a way to do it in pyspark due to the specific output format..

Thank you, appreciate any help.

Original Q&A

There are 2 answers

Alex_Y On 22 January 2024 at 16:03

Ok, seems this can be done with groupby + pivot:

df.groupBy('id').pivot('category').count().show()

Small question how to properly store this encoder to use in inference part..
Just list of columns and the compare with inference, creating missing with 0?
Any better ideas?

**Arunbh Yashaswi** · Accepted Answer · 2024-01-22T16:07:01+00:00

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
encoder = OneHotEncoder(inputCols=["categoryIndex"], outputCols=["categoryVec"])

pipeline = Pipeline(stages=[indexer, encoder])
model = pipeline.fit(df)
transformed_df = model.transform(df)

result = transformed_df.groupBy("id").pivot("category").agg(count("categoryVec"))

result.show()

Converting the values to indices using StringIndexer, applying the OHE and then pivoting around id and lastly aggregating everything together.

Changed from max to count as per your ask

TechQA.

Pyspark one-hot encoding with grouping same id

There are 2 answers

Related Questions in PYTHON

Related Questions in PYSPARK

Related Questions in ONE-HOT-ENCODING

Popular Questions

Trending Questions