I want to get list of a column values in aggregated function, in pyspark 1.4. The collect_list
is not available. Does anyone have suggestion how to do it?
Original columns:
ID, date, hour, cell
1, 1030, 01, cell1
1, 1030, 01, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4
I want output like below, groupby (ID, date, hour)
ID, date, hour, cell_list
1, 1030, 01, cell1, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4
But my pyspark is in 1.4.0, collect_list
is not available. I can't do:
df.groupBy("ID","date","hour").agg(collect_list("cell"))
.
Spark 1.4 is old, not supported, slow, buggy and compatible with current versions. You should really consider upgrading Spark installation
Enable Hive support, register
DataFrame
as a temporary table, and useSince you use YARN you should should be able to submit any version of Spark code, but it might require placing custom PySpark version on the
PYTHONPATH
.