How to parse relaxed JSON in pyspark with a "dynamic" schema

50 views Asked by At

I have a dataset and one of the columns contains a string that looks like a "relaxed" JSON format (no double quotes around keys).

#1 I'm looking for a way to parse it out in pyspark - i tried from_json + schema but because it isn't the right JSON format so this didn't work (all nested columns resulted in NULL value)

#2 the keys in the json column can change depending on which questions users answered. In the example below it is question5 and question7, but in reality this key can change and so the structure of the values. How can i handle this "dynamic" format? Could someone share an example?

my dataset looks like this:

customer_id json_data
1 { metadata: { version: 1. }, attributes: { question5: [ { provenance: { type: "confirmed", value: { origin: "ABC", timestamp: "2010-07-12T23:00:51Z" } }, value: { type: "struct", value: { response: { type: "list", value: [ { type: "stringValue", value: "import value 1" } ] }, category: { type: "stringValue", value: "job" } } } }, { provenance: { type: "confirmed", value: { origin: "ABC", timestamp: "2010-07-12T23:00:51Z" } }, value: { type: "struct", value: { response: { type: "list", value: [ { type: "stringValue", value: "address 1" }, { type: "stringValue", value: "address 2" } ] }, category: { type: "stringValue", value: "address" } } } } ], question7: [ { provenance: { type: "confirmed", value: { origin: "XYZ", timestamp: "2010-07-12T23:00:51Z" } }, value: { type: "stringValue", value: "yes" } } ] } }

I was able to parse out the dataset in python using a series of json_normalize with for loops. However the tool i'm using requires pyspark instead hence the questions above. Much appreciate your time and help!

0

There are 0 answers