I have a pyspark DataFrame that contains to columns, each one is an array of strings, how can I make a new column that is the cartesian product of them without splitting them to two dataframe and join them, and without a udf?
Example:
In df:
Df
+---+---+---+---+-
| a1 | a2 |
+---+---+---+---+-
|[1, 2]|[3, 4, 5]|
|[1, 2]|[7, 8] |
+---+---+---+---+-
Out df:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| a1 | a2 | a3 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|[1, 2]|[3, 4, 5]|[{1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}] |
|[1, 2]|[7, 8] |[{1, 7}, {1, 8}, {2, 7}, {2, 8}] |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
You can try nesting
transformto create cartesian product.This will result in a nested array and you can use
flattento get the final single array.Result