how to concatenate complex data type columns with other type columns in pyspark data-frame?

255 views Asked by At

I am trying to concatenate string , int , array<string> , array<array<string>> and

|-- components: array (nullable = true)
 |    |-- element: struct (containsNull = true) 

but trying with concat_ws is throwing an below error : array<array> type. argument 28 requires (array or string) type

it seems concat_ws() does not work over complex data types . Is there alternative of concat_ws() to achieve above requirement. also this should work dynamically that means column name should not be hardcoded and it should work for any column .

1

There are 1 answers

1
mck On

I tried to combine the whole row into an array and then into a json. After that md5 works.

df = spark.createDataFrame([[1]]).selectExpr("array(array('1','2'),array('3','4')) as col", "array(array('1','2'),array('3','4')) as col2")

df.show()
+----------------+----------------+
|             col|            col2|
+----------------+----------------+
|[[1, 2], [3, 4]]|[[1, 2], [3, 4]]|
+----------------+----------------+

df.select(F.md5(F.to_json(F.array(df.columns))).alias('md5')).show(truncate=False)
+--------------------------------+
|md5                             |
+--------------------------------+
|ae5cf1132240349bdc100d9f6ff4dd8b|
+--------------------------------+