I am trying to execute a spark notebook in which I am trying to process json file which has the a json array in one of columns named "data" :

[{"payload":"here to check 1.\n checking again 1", "key1": " value1"},
 {"payload":"here to check 2.\n checking again 2", "key1": " value2"}]

I use explode_outer() to split the column "data" into different rows where each json in the array makes 1 row. So I expect 2 rows as results.
The problem is instead of 2 rows, I get 4 rows. It seems to split on the "\n" in the "payload" parameter of the json. Could anybody guide me on how to fix this?
eg: json:

{"alldata" : [{"payload":"here to check 1.\n checking again 1", "key1": " value1"},
{"payload":"here to check 2.\n checking again 2", "key1": " value2"}], "alldata1": "any"}

code:

    var load =  spark.read.schema(schema).json(path);
      var queryValTemp = load.select(
            col("alldata1"),
       explode_outer(col("alldata")).as("explode_data")
            );
        queryValTemp.show();

1 Answers

0
jishamenon On Best Solutions

From what I discovered, if there are \n in a column with json; calling explode_outer() will split one column into 2 rows instead of 1. The only way to solve this is before calling explode_outer():

  • call to_json() on the column
  • replaceAll() for removing \n.
  • call from_json() on the result of replaceAll.