Linked Questions

Popular Questions

pyspark arrays_zip exclude index

Asked by At

I'm using the new pyspark arrays_zip function in v2.4 to zip the following arrays:

["ABK","APR","ABF"]
["R0789","R0602","E039"])

The result is:

[{"0":"ABK","1":"R0789"},{"0":"APR","1":"R0602"},{"0":"ABF","1":"E039"}]

How do I get the following result instead?

[{"ABK":"R0789"},{"APR":"R0602"},{"ABF":"E039"}]

I'm not zipping columns directly. The columns are JSON so I'm using get_json_object to get a string array i.e. it looks like an array but it's actually a string. Then I convert the string to an actual array in a custom function using the split function.

arrays_zip(myStringArrayToArray(get_json_object(...

The pyspark documentation shows this example and does not show/mention index values being included in the result:

from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2'])
df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect()
[Row(zipped=[Row(vals1=1, vals2=2), Row(vals1=2, vals2=3), Row(vals1=3, vals2=4)])]

Update: I've confirmed that my array example does match the results of the sample provided in the documentation. Because I'm using arrays_zip against arrays it uses indexes. If they were columns (like the documentation) they would be column names (instead of indexes). So the fact that I'm doing some string to array conversion is not the issue here.

I was expecting arrays_zip to behave more like the Python zip function e.g.

a1 = [1, 2, 3]
a2 = ['one', 'two', 'three']

zip(a1, a2)
{(2, 'two'), (3, 'three'), (1, 'one')}

Maybe a UDF is the only option here.

Related Questions