How to normalise complex JSON with multiple levels of nested information in Python

153 views Asked by At

I get the data in a predefined format I cannot change:

data = 
{
  "id": 12345,
  "name": "Doe",
  "gender": {
    "textEn": "Masculin"
  },
  "professions": [
    {
      "job_description": {
        "textEn": "Job description"
      },
      "cetTitles": [
        {
          "cetTitleType": {
            "textEn": "Recognition"
          },
          "issuanceDate": "1992-04-14T00:00:00Z",
          "phoneNumbers": [
            "123 221 00 70"
          ]
        }
      ]
    }
  ]
}

While I can normalise data one level down ('gender') through the pd.json_normalize function, I struggle to access info further down in the hierarchy.

I tried e.g. getting the job description through data = pd.json_normalize(data,record_path=['professions','job_description'],meta='id') but I get a TypeError. How can I address this and extract all data into a table so the data above is represented as one row with all content as fields?

The output I would like to ultimately achieve looks as follows:

id name gender job_description cetTitleType issuanceDate phoneNumbers
12345 Doe Masculin Job description Recognition 1992-04-14T00:00:00Z 123 221 00 70
1

There are 1 answers

1
Jason Baker On BEST ANSWER

Using json_normalize():

Code:

df = pd.json_normalize(
    data=data,
    record_path=["professions", "cetTitles"],
    meta=["id", "name", "gender", ["professions", "job_description"]]
).explode(column="phoneNumbers")

df = df.join(pd.DataFrame(df.pop("gender").values.tolist()))
df.columns = df.columns.str.split(".").str[-1]
df = df.rename(columns={"textEn": "gender"})
df = df.join(pd.DataFrame(df.pop("job_description").values.tolist()))
df = df.rename(columns={"textEn": "job_description"})
print(df)

Output:

           issuanceDate   phoneNumbers       gender     id name    gender  job_description
0  1992-04-14T00:00:00Z  123 221 00 70  Recognition  12345  Doe  Masculin  Job description