I have a df with one column type and I have two lists

women = ['0980981', '0987098']
men = ['1234567', '4567854']

now I want to add another column based on the value of the type column like this:

from pyspark.sql import functions as psf
df_ = df.withColumn('new_col', psf.when(psf.col('type') == 'men', men).when(psf.col('type') == 'women', women))

But I guess we cannot insert list directly as we can insert Array('1234567', '4567854') in Scala. I have tried psf.lit(men) as well but no luck.

Any idea on how to do it?

1 Answers

1
Marcus Lim On Best Solutions

Use pyspark.sql.functions.array, which takes a list of column expressions and returns a single column expression of Array type, in conjunction with a list comprehension over men:

men = ['1234567', '4567854']

df = spark.createDataFrame([['women'], ['men']], 'type: string')
df.withColumn('new_col', F.when(F.col('type') == 'men', F.array([F.lit(string) for string in men]))).show()

Output:

+-----+------------------+
| type|           new_col|
+-----+------------------+
|women|              null|
|  men|[1234567, 4567854]|
+-----+------------------+