Flattening the nested json file to dataframe using pandas json_normalise

236 views Asked by At

I hace a big json file data and I want to convert it in to tabular form. I am trying to flatten the data in to dataframe using json_nomalise. so Far I have this :

code so far

I want to further flatten the submissions and product data in columns i tried this:

submission_data = pd.json_normalize(data=rawData['results'], record_path=rawData['results']['submissions'], meta=['application_number', 'sponsor_name'] , errors='ignore') submission_data.head(3)

But I am getting error saying: TypeError: list indices must be integers or slices, not str

Any output on this will be helpful

1

There are 1 answers

3
Lourenço Monteiro Rodrigues On

As submissions and Products are lists (and not objects with a regular structure), JSON_normalize will leave them untouched. Also, given that they are lists, can you make sure that they are always the same number for each record? If not, distributing them trough columns makes no sense. If submissions and products are pairs (i.e. if every submission corresponds to one product) you can consider distributing along lines (In a melting dataframe strategy).

finally, regarding the error, raw_data seems to be a list of objects that contain a 'results' field. That means you cannot retrieve directly raw_data['results'], but only raw_data[0]['results'] to get the results from the first object.

Adding a solution proposition

Given your data structure, what I would do is the following:

  1. normalize the raw_data as you do in the notebook.
  2. for each line of the resulting dataframe: a. normalize the json in 'submissions' field b. change the column names of that resulting dataframe to 'submissions.<column_name>'. c. add a column with value equal to the application number of the line you are evaluating. d. add that resulting df to a list, collecting all such dataframes
  3. concatenate those dataframes
  4. merge the original dataframe with the concatenated one using 'application_number' as the key, and drop the submissions column.

Repeat the process for the 'products'; however, unless you know the relationship between submissions and products, there is no clear way of merging the dataframes you get:

  • If they have no relationship except for being under the same application number, you basically get separate datasets for each.
  • If there is a one-to-one relationship, you can just merge them by index (concatenate each line)

in code:

df = pd.normalize_json(raw_data)

submissions = []
products = []

for i, line in df.iterrows():
    temp_df_sub = pd.normalize_json(line['submissions'])
    temp_df_sub.cols = list(map(lambda x: f'submissions.{x}', temp_df_sub)
    temp_df_sub['application_number'] = line['application_number']
    submissions.append(temp_df_sub)

    temp_df_prod = pd.normalize_json(line['products'])
    temp_df_prod.cols = list(map(lambda x: f'products.{x}', temp_df_sub)
    temp_df_prod['application_number'] = line['application_number']
    products.append(temp_df_prod)

submissions_df = pd.concat(submissions)
products_df = pd.concat(products)


# if one-to-one relationship between submissions and products
sub_prod_df = pd.concat([submissions_df, products_df], axis=1)
final_df = df.merge(sub_prod_df, on='application_number')


# if no relationship
final_sub_df = submissions_df.merge(df, on='application_number')
final_prod_df = products_df.merge(df, on='application_number')