How to parse relaxed JSON in pyspark with a "dynamic" schema

65 views Asked by Dan Pham At 15 November 2023 at 20:09

I have a dataset and one of the columns contains a string that looks like a "relaxed" JSON format (no double quotes around keys).

#1 I'm looking for a way to parse it out in pyspark - i tried from_json + schema but because it isn't the right JSON format so this didn't work (all nested columns resulted in NULL value)

#2 the keys in the json column can change depending on which questions users answered. In the example below it is question5 and question7, but in reality this key can change and so the structure of the values. How can i handle this "dynamic" format? Could someone share an example?

my dataset looks like this:

customer_id	json_data
1	{ metadata: { version: 1. }, attributes: { question5: [ { provenance: { type: "confirmed", value: { origin: "ABC", timestamp: "2010-07-12T23:00:51Z" } }, value: { type: "struct", value: { response: { type: "list", value: [ { type: "stringValue", value: "import value 1" } ] }, category: { type: "stringValue", value: "job" } } } }, { provenance: { type: "confirmed", value: { origin: "ABC", timestamp: "2010-07-12T23:00:51Z" } }, value: { type: "struct", value: { response: { type: "list", value: [ { type: "stringValue", value: "address 1" }, { type: "stringValue", value: "address 2" } ] }, category: { type: "stringValue", value: "address" } } } } ], question7: [ { provenance: { type: "confirmed", value: { origin: "XYZ", timestamp: "2010-07-12T23:00:51Z" } }, value: { type: "stringValue", value: "yes" } } ] } }

customer_id

json_data

{ metadata: { version: 1. }, attributes: { question5: [ { provenance: { type: "confirmed", value: { origin: "ABC", timestamp: "2010-07-12T23:00:51Z" } }, value: { type: "struct", value: { response: { type: "list", value: [ { type: "stringValue", value: "import value 1" } ] }, category: { type: "stringValue", value: "job" } } } }, { provenance: { type: "confirmed", value: { origin: "ABC", timestamp: "2010-07-12T23:00:51Z" } }, value: { type: "struct", value: { response: { type: "list", value: [ { type: "stringValue", value: "address 1" }, { type: "stringValue", value: "address 2" } ] }, category: { type: "stringValue", value: "address" } } } } ], question7: [ { provenance: { type: "confirmed", value: { origin: "XYZ", timestamp: "2010-07-12T23:00:51Z" } }, value: { type: "stringValue", value: "yes" } } ] } }

I was able to parse out the dataset in python using a series of json_normalize with for loops. However the tool i'm using requires pyspark instead hence the questions above. Much appreciate your time and help!

Original Q&A

TechQA.

How to parse relaxed JSON in pyspark with a "dynamic" schema

There are 0 answers

Related Questions in PYTHON

Related Questions in JSON

Related Questions in PYSPARK

Related Questions in FROMJSON

Related Questions in CONVERTFROM-JSON

Popular Questions

Popular Tags

Trending Questions