Read nested JSON into Dask DataFrame

789 views Asked by At

I am trying to read nested JSON into a Dask DataFrame, preferably with code that'll do the heavy lifting.

Here's the JSON file I am reading:

{
    "data": [{
        "name": "george",
        "age": 16,
        "exams": [{
                "subject": "geometry",
                "score": 56
            },
            {
                "subject": "poetry",
                "score": 88
            }
        ]

    }, {
        "name": "nora",
        "age": 7,
        "exams": [{
                "subject": "geometry",
                "score": 87
            },
            {
                "subject": "poetry",
                "score": 94
            }
        ]
    }]
}

Here is the resulting DataFrame I would like.

name age exam_subject exam_score
george 16 geometry 56
george 16 poetry 88
nora 7 geometry 87
nora 7 poetry 94

Here's how I'd accomplish this with pandas:

df = pd.read_json("students3.json", orient="split")
exploded = df.explode("exams")
pd.concat([exploded[["name", "age"]].reset_index(drop=True), pd.json_normalize(exploded["exams"])], axis=1)

Dask doesn't have json_normalize, so what's the best way to accomplish this task?

4

There are 4 answers

0
Brent On

Pydantic offers excellent JSON validation and ingest. Several Pydantic models (one of each 'top level' JSON entry) can be converted to Python dictionaries in a loop to create a list of dictionaries, type: List[Dict], which may be converted to DataFrame objects.

1
Powers On

I was inspired by the other answers to come up with this solution.

ddf = dd.read_json("students3.json", orient="split")

def pandas_fn(df):
    exploded = df.explode("exams")
    return pd.concat(
        [
            exploded[["name", "age"]].reset_index(drop=True),
            pd.json_normalize(exploded["exams"]),
        ],
        axis=1,
    )

res = ddf.map_partitions(
    lambda df: pandas_fn(df),
    meta=(
        ("name", "object"),
        ("age", "int64"),
        ("subject", "object"),
        ("score", "int64"),
    ),
)

print(res.compute()) gives this output:

     name  age   subject  score
0  george   16  geometry     56
1  george   16    poetry     88
2    nora    7  geometry     87
3    nora    7    poetry     94
1
SultanOrazbayev On

If the file contains json-lines, then the most scale-able approach is to use dask.bag and then map the pandas snippet across each bag partition.

If the file is a large json, then the opening/ending brackets will cause problems, so an additional function will be needed to remove them before mapping the text into json.

Rough pseudo-code:

import dask.bag as db

bag = db.read_text("students3.json")

# if there are json-lines 
option1 = bag.map(json.loads).map(pandas_fn)

# if there is a single json
option2 = bag.map(convert_to_jsonlines).map(json.loads).map(pandas_fn)
2
Ynjxsjmh On

Use pd.json_normalize

import json
import pandas as pd

with open('students3.json', 'r', encoding='utf-8') as f:
    data = json.loads(f.read())

df = pd.json_normalize(data['data'], record_path='exams', meta=['name', 'age'])
    subject  score    name age
0  geometry     56  george  16
1    poetry     88  george  16
2  geometry     87    nora   7
3    poetry     94    nora   7