pyspark 1.4 how to get list in aggregated function

Question

pyspark 1.4 how to get list in aggregated function

138 views Asked by Helen Z At 26 October 2024 at 23:14

I want to get list of a column values in aggregated function, in pyspark 1.4. The collect_list is not available. Does anyone have suggestion how to do it?

Original columns:

ID, date, hour, cell
1, 1030, 01, cell1
1, 1030, 01, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4

I want output like below, groupby (ID, date, hour)

ID, date, hour, cell_list
1, 1030, 01, cell1, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4

But my pyspark is in 1.4.0, collect_list is not available. I can't do: df.groupBy("ID","date","hour").agg(collect_list("cell")).

Original Q&A

There are 1 answers

**Alper t. Turker** · Answer 1 · 2017-12-06 23:23:58

Spark 1.4 is old, not supported, slow, buggy and compatible with current versions. You should really consider upgrading Spark installation

Enable Hive support, register DataFrame as a temporary table, and use

sqlContext  = HiveContext(sc)

df = ...  # create table using HiveContext
df.registerTempTable("df")

sqlContext.sql(
  "SELECT id, date, hour, collect_list(cell) GROUP BY id, date, hour FROM df" 
)

Since you use YARN you should should be able to submit any version of Spark code, but it might require placing custom PySpark version on the PYTHONPATH.

TechQA.

pyspark 1.4 how to get list in aggregated function

There are 1 answers

Related Questions in PYTHON

Related Questions in LIST

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-1.4

Popular Questions

Popular Tags

Trending Questions