pyspark 1.4 how to get list in aggregated function

138 views Asked by At

I want to get list of a column values in aggregated function, in pyspark 1.4. The collect_list is not available. Does anyone have suggestion how to do it?

Original columns:

ID, date, hour, cell
1, 1030, 01, cell1
1, 1030, 01, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4

I want output like below, groupby (ID, date, hour)

ID, date, hour, cell_list
1, 1030, 01, cell1, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4

But my pyspark is in 1.4.0, collect_list is not available. I can't do: df.groupBy("ID","date","hour").agg(collect_list("cell")).

1

There are 1 answers

0
Alper t. Turker On

Spark 1.4 is old, not supported, slow, buggy and compatible with current versions. You should really consider upgrading Spark installation

Enable Hive support, register DataFrame as a temporary table, and use

sqlContext  = HiveContext(sc)

df = ...  # create table using HiveContext
df.registerTempTable("df")

sqlContext.sql(
  "SELECT id, date, hour, collect_list(cell) GROUP BY id, date, hour FROM df" 
)

Since you use YARN you should should be able to submit any version of Spark code, but it might require placing custom PySpark version on the PYTHONPATH.