i have json file with struct columns empty for 1 record and having values for another record. i want to read it in scala and create dataframe

32 views Asked by At

i have json file with struct columns empty for 1 record and having values for another record. i want to read it in scala and crate dataframe. Code I am using:

var df: DataFrame = spark.read.option("multiline", "true").option("mode", "PERMISSIVE").json(path)
var updatedDf: DataFrame = {
if (null != parentNode && parentNode.trim.nonEmpty) {
df.select(explode(col(parentNode)).as(parentNode)).select(s"$parentNode.\*")}
else df }
println(parentNode)

Its working when I have consistent records but not working with below json file.

{
    "result": [
        {
            "parent": "",
            "roles": "",
            "sys_created_by": "s466892"
        },
        {
            "parent": {
                "display_value": "AAT 01 Revenue Management and SkyCargo",
                "link": "https://emiratesgroup.service-now.com/api/now/"
            },
            "roles": "",
            "sys_created_by": "S331704"
        }
    ]
}

expecting to load the data into a dataframe

1

There are 1 answers

0
Lingesh.K On

You can use from_json and to_json functions that are part of the Spark in-built function list on top of a schema that is well-defined before you apply them:

// Define the schema for the "parent" field
val parentSchema = new StructType()
  .add("display_value", StringType, nullable = true)
  .add("link", StringType, nullable = true)

// Define the schema for the "result" field
val resultSchema = new StructType()
  .add("parent", parentSchema, nullable = true)
  .add("roles", StringType, nullable = true)
  .add("sys_created_by", StringType, nullable = true)

// Define the overall schema
val schema = new StructType()
  .add("result", ArrayType(resultSchema), nullable = true)

val df = spark.read.option("multiline", "true").json(path)

// Parse the JSON string into a DataFrame
val convertedDf = df.select(from_json(to_json(struct(col("*"))), schema).as("data"))

// Explode the "result" array and select the fields
val finalDF = convertedDf.select(explode(col("data.result")).as("result"))
  .select("result.*")

The result of the finalDF is as follows:

+----------------------------------------------------------------------------------------+-----+--------------+
|parent                                                                                  |roles|sys_created_by|
+----------------------------------------------------------------------------------------+-----+--------------+
|NULL                                                                                    |     |s466892       |
|{AAT 01 Revenue Management and SkyCargo, https://emiratesgroup.service-now.com/api/now/}|     |S331704       |
+----------------------------------------------------------------------------------------+-----+--------------+