I have a dataframe as
+---------------------------------------------------------------+---+
|family_name |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]] |id2|
+---------------------------------------------------------------+---+
root
|-- family_name: string (nullable = true)
|-- id: string (nullable = true)
I wish to convert the column fam_name to array of named structs as
`family_name` array<struct<f_name:string,l_name:string,status:string,ph_no:string>>
Im able to convert family_name to array as shown below
val sch = ArrayType(ArrayType(StringType))
val fam_array = data
.withColumn("family_name_clean", regexp_replace($"family_name", "\\[\\[", "["))
.withColumn("family_name_clean_clean1", regexp_replace($"family_name_clean", "\\]\\]", "]"))
.withColumn("ar", toArray($"family_name_clean_clean1"))
//.withColumn("ar1", from_json($"ar", sch))
fam_array.show(false)
fam_array.printSchema()
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|family_name |id |family_name_clean |family_name_clean_clean1 |ar |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]|[[John, Doe, Married, 999-999-9999], [Jane, Doe, Married, Wife, ]]|
|[[Tom, Riddle, Single, 888-888-8888]] |id2|[Tom, Riddle, Single, 888-888-8888]] |[Tom, Riddle, Single, 888-888-8888] |[[Tom, Riddle, Single, 888-888-8888]] |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
root
|-- family_name: string (nullable = true)
|-- id: string (nullable = true)
|-- family_name_clean: string (nullable = true)
|-- family_name_clean_clean1: string (nullable = true)
|-- ar: array (nullable = true)
| |-- element: string (containsNull = true)
sch is a schema variable of desired type.
How do I convert column ar to array<struct<>> ?
EDIT:
I'm using Spark 2.3.2
To create an array of structs given an array of arrays of strings, you can use
structfunction to build a struct given a list of columns combined withelement_atfunction to extract column element at a specific index of an array.To solve your specific problem, as you correctly stated you need to do two things:
In Spark 3.0 and greater
Using Spark 3.0, we can perform all those steps using spark built-in functions.
For the first step, I would do as follows:
[[and]]fromfamily_namestring usingregexp_replacefunctionsplitfunctiontransformandsplitfunctionsAnd for the second step, use
structfunction to build astruct, picking element in arrays usingelement_atfunction.Thus, complete code using Spark 3.0 and greater would be as follows, with
dataas input dataframe:In Spark 2.X
Using Spark 2.X, we have to rely on an user-defined function. First, we need to define a
case classthat represent ourstruct:Then, we define our user-defined function and apply it to our input dataframe:
Result
If you have the following
datadataframe:You get the following
resultdataframe:having the following schema: