How to quickly un-nest a Pandas dataframe

1.2k views Asked by At

A JSON file I need to work with imports into a dataframe with lists nested inside, before converting to a dataframe it is a list of nested dicts. The file itself is nested.

Sample JSON:

{
  "State": [
    {
      "ts": "2018-04-11T21:37:05.401Z",
      "sensor": [
        "accBodyX_ftPerSec2"
      ],
      "value": null
    },
    {
      "ts": "2018-04-11T21:37:05.901Z",
      "sensor": [
        "accBodyX_ftPerSec2"
      ],
      "value": [
        -3.38919
      ]
    },
    {
      "ts": "2018-04-11T21:37:05.901Z",
      "sensor": [
        "accBodyY_ftPerSec2"
      ],
      "value": [
        -2.004781
      ]
    },
    {
      "ts": "2018-04-11T21:37:05.901Z",
      "sensor": [
        "accBodyZ_ftPerSec2"
      ],
      "value": [
        -34.77694
      ]
    }
  ]
}

The dataframe looks like:

    sensor                  ts                          value
0   [accBodyX_ftPerSec2]    2018-04-11T21:37:05.901Z    [-3.38919]
1   [accBodyY_ftPerSec2]    2018-04-11T21:37:05.901Z    [-2.004781]
2   [accBodyZ_ftPerSec2]    2018-04-11T21:37:05.901Z    [-34.77694]

Ultimately, I'd like the remove the nesting or find a way to work with it. The goal is to extract a list of values for a given sensor name with accompanying timestamp into another dataframe for processing/plotting, something like this:

    ts                         value
0   2018-04-11T21:37:05.901Z   -3.38919
1   2018-04-11T21:37:06.401Z   -3.00241
2   2018-04-11T21:37:06.901Z   -3.87694

To remove the nesting I've done this but it is slow on just 100,000 rows but thankfully much faster than a for loop. (made possible thanks to this post python pandas operations on columns)

def func(row):
    row.sensor = row.sensor[0]
    if type(row.value) is list:
        row.value = row.value[0]
    return row

df.apply(func, axis=1)

For working with the nesting I'm able to extract individual values. For example this:

print( df.iloc[:,2].iloc[1][0] )
-2.004781

However, trying to return a list of values from index 0 of each list within each row results in returning just the first value:

print( df.iloc[:,2].iloc[:][0] )
-3.38919

Of course I could do this with a for loop but I know there's a way to do it with Pandas functions that I'm not able to discover yet.

2

There are 2 answers

5
Brad Solomon On BEST ANSWER

You may need to just do some manual cleaning-up before reading into a DataFrame:

>>> import json
>>> import pandas as pd


>>> def collapse_lists(data):
...     return [{k: v[0] if (isinstance(v, list) and len(v) == 1)
...             else v for k, v in d.items()} for d in data]


>>> with open('state.json') as f:
...     data = pd.DataFrame(collapse_lists(json.load(f)['State']))

>>> data
               sensor                        ts      value
0  accBodyX_ftPerSec2  2018-04-11T21:37:05.401Z        NaN
1  accBodyX_ftPerSec2  2018-04-11T21:37:05.901Z  -3.389190
2  accBodyY_ftPerSec2  2018-04-11T21:37:05.901Z  -2.004781
3  accBodyZ_ftPerSec2  2018-04-11T21:37:05.901Z -34.776940

This loads the JSON file into a Python list of dictionaries, converts any length-1 lists into scalar values, and then loads that result into a DataFrame. That is admittedly not the most efficient means, but your other option of parsing the JSON itself is probably overkill unless the file is massive.

Finally, to convert to datetime:

>>> data['ts'] = pd.to_datetime(data['ts'])

>>> data.dtypes
sensor            object
ts        datetime64[ns]
value            float64
dtype: object

You may also want to consider converting sensor to a categorical data type to save a possibly significant amount of memory:

The memory usage of a Categorical is proportional to the number of categories plus the length of the data. In contrast, an object dtype is a constant times the length of the data. (source)


In explicit-loop form, this would look like:

def collapse_lists(data):
    result = []
    for d in data:
        entry = {}
        for k, v in d.items():
            if isinstance(k, list) and len(v) == 1:
                entry.update({k: v[0]})
            else:
                entry.update({k: v})
        result.append(entry)
    return result
1
Matt Messersmith On

In case you ever get the case where you have multiple values/sensors, here's some code that might help.

The test JSON (modified to have multiple values/sensors):

{
    "State": [
        {
            "ts": "2018-04-11T21:37:05.401Z",
            "sensor": [
                "accBodyX_ftPerSec2"
            ],
            "value": null
        },
        {
            "ts": "2018-04-11T21:37:05.100Z",
            "sensor": [
                "accBodyX_ftPerSec2",
                "accBodyY_ftPerSec2"
            ],
            "value": null
        },
        {
            "ts": "2018-04-11T21:37:05.901Z",
            "sensor": [
                "accBodyX_ftPerSec2"
            ],
            "value": [
                -3.38919
            ]
        },
        {
            "ts": "2018-04-11T21:37:05.901Z",
            "sensor": [
                "accBodyY_ftPerSec2"
            ],
            "value": [
                 -2.004781
            ]
        },
        {
            "ts": "2018-04-11T21:37:05.901Z",
            "sensor": [
                "accBodyX_ftPerSec2",
                "accBodyY_ftPerSec2",
                "accBodyZ_ftPerSec2"
            ],
            "value": [
                -1.234567,
                4.56789,
                -34.77694
            ]
        }
    ]
}

Some code to beat it into a df such that each timestamp/sensor combo is a new row:

import json
import pandas as pd

def grab_json(json_filename):
    with open(json_filename, 'r') as f:
        json_str = f.read()
    json_dict = json.loads(json_str)
    resturn json_dict

def create_row_per_timestamp_and_sensor(data):
    result = []
    for sub_dict in data:
        # Make sure we have an equal number of sensors/values
        values = [None]*len(sub_dict['sensor']) if sub_dict['value'] is None else sub_dict['value']

        # Zip and iterate over each sensor/value respectively
        for sensor, value in zip(sub_dict['sensor'], values):
            result.append({'ts': sub_dict['ts'],
                           'sensor': sensor,
                           'value': value})
    return result


json_dict = grab_json("df.json")  # instead of "df.json" put your filename instead
df_list = create_row_per_timestamp_and_sensor(json_dict['State'])
new_df = pd.DataFrame(df_list)
print(new_df)

outputs:

               sensor                        ts      value
0  accBodyX_ftPerSec2  2018-04-11T21:37:05.401Z        NaN
1  accBodyX_ftPerSec2  2018-04-11T21:37:05.100Z        NaN
2  accBodyY_ftPerSec2  2018-04-11T21:37:05.100Z        NaN
3  accBodyX_ftPerSec2  2018-04-11T21:37:05.901Z  -3.389190
4  accBodyY_ftPerSec2  2018-04-11T21:37:05.901Z  -2.004781
5  accBodyX_ftPerSec2  2018-04-11T21:37:05.901Z  -1.234567
6  accBodyY_ftPerSec2  2018-04-11T21:37:05.901Z   4.567890
7  accBodyZ_ftPerSec2  2018-04-11T21:37:05.901Z -34.776940