Python/PySpark parse JSON string with numbered attributes

118 views Asked by At

I need to store JSON strings like the one below in some file format different from plaintext (e.g: parquet):

{
  "vidName": "Foo",
  "vidInfo.0.size.length": 10,
  "vidInfo.0.size.width": 10,
  "vidInfo.0.quality": "Good",
  "vidInfo.1.size.length": 7,
  "vidInfo.1.size.width": 3,
  "vidInfo.1.quality": "Bad",
  "vidInfo.2.size.length": 10,
  "vidInfo.2.size.width": 2,
  "vidInfo.2.quality": "Excelent"
}

There's no known bound for the index of vidInfo (can be 10, 20). Thus I want either to have vidInfos in an array, or explode such JSON object into multiple smaller objects.

I found this question: PHP JSON parsing (number attributes?) But it is in PHP which I do not quite understand. And I am not sure whether it is same as what I need.

The intermediate data should be something like this:

{
  "vidName": "Foo",
  "vidInfo": [
    {
      "id": 0,
      "size": {
        "length": 10,
        "width": 10
      },
      "quality": "Good"
    },
    {
      "id": 1,
      "size": {
        "length": 7,
        "width": 3
      },
      "quality": "Bad"
    },
    {
      "id": 2,
      "size": {
        "length": 10,
        "width": 2
      },
      "quality": "Excelent"
    }
  ]
}

or like this:

{
  "vidName": "Foo",
  "vidInfo": [
    {
      "size": {
        "length": 10,
        "width": 10
      },
      "quality": "Good"
    },
    {
      "size": {
        "length": 7,
        "width": 3
      },
      "quality": "Bad"
    },
    {
      "size": {
        "length": 10,
        "width": 2
      },
      "quality": "Excelent"
    }
  ]
}

I am stuck, and would need some hints to move on. Could you please help? Thanks a lot for your help.

1

There are 1 answers

0
Averell On

I found this library https://github.com/amirziai/flatten which does the trick.

In [154]: some_json = {
 ...:   "vidName": "Foo",
 ...:   "vidInfo.0.size.length": 10,
 ...:   "vidInfo.0.size.width": 10,
 ...:   "vidInfo.0.quality": "Good",
 ...:   "vidInfo.1.size.length": 7,
 ...:   "vidInfo.1.size.width": 3,
 ...:   "vidInfo.1.quality": "Bad",
 ...:   "vidInfo.2.size.length": 10,
 ...:   "vidInfo.2.size.width": 2,
 ...:   "vidInfo.2.quality": "Excelent"
 ...: }

In [155]: some_json
Out[155]:
{'vidName': 'Foo',
 'vidInfo.0.size.length': 10,
 'vidInfo.0.size.width': 10,
 'vidInfo.0.quality': 'Good',
 'vidInfo.1.size.length': 7,
 'vidInfo.1.size.width': 3,
 'vidInfo.1.quality': 'Bad',
 'vidInfo.2.size.length': 10,
 'vidInfo.2.size.width': 2,
 'vidInfo.2.quality': 'Excelent'}

In [156]: from flatten_json import unflatten_list
     ...: import json 
     ...: nested_json = unflatten_list(json.loads(json.dumps(some_json)), '.')

In [157]: nested_json
Out[157]:
{'vidInfo': [{'quality': 'Good', 'size': {'length': 10, 'width': 10}},
  {'quality': 'Bad', 'size': {'length': 7, 'width': 3}},
  {'quality': 'Excelent', 'size': {'length': 10, 'width': 2}}],
 'vidName': 'Foo'}