I am trying to read nested JSON into a Dask DataFrame, preferably with code that'll do the heavy lifting.
Here's the JSON file I am reading:
{
"data": [{
"name": "george",
"age": 16,
"exams": [{
"subject": "geometry",
"score": 56
},
{
"subject": "poetry",
"score": 88
}
]
}, {
"name": "nora",
"age": 7,
"exams": [{
"subject": "geometry",
"score": 87
},
{
"subject": "poetry",
"score": 94
}
]
}]
}
Here is the resulting DataFrame I would like.
name | age | exam_subject | exam_score |
---|---|---|---|
george | 16 | geometry | 56 |
george | 16 | poetry | 88 |
nora | 7 | geometry | 87 |
nora | 7 | poetry | 94 |
Here's how I'd accomplish this with pandas:
df = pd.read_json("students3.json", orient="split")
exploded = df.explode("exams")
pd.concat([exploded[["name", "age"]].reset_index(drop=True), pd.json_normalize(exploded["exams"])], axis=1)
Dask doesn't have json_normalize
, so what's the best way to accomplish this task?
Pydantic offers excellent JSON validation and ingest. Several Pydantic models (one of each 'top level' JSON entry) can be converted to Python dictionaries in a loop to create a list of dictionaries, type: List[Dict], which may be converted to DataFrame objects.