How to update a pandas dataframe, from multiple API calls

3.8k views Asked by At

I need to do a python script to

  1. Read a csv file with the columns (person_id, name, flag). The file has 3000 rows.
  2. Based on the person_id from the csv file, I need to call a URL passing the person_id to do a GET http://api.myendpoint.intranet/get-data/1234 The URL will return some information of the person_id, like example below. I need to get all rents objects and save on my csv. My output needs to be like this
import pandas as pd
import requests

ids = pd.read_csv(f"{path}/data.csv", delimiter=';')
person_rents = df = pd.DataFrame([], columns=list('person_id','carId','price','rentStatus'))

for id in ids:
    response = request.get(f'endpoint/{id["person_id"]}')
    json = response.json()
    person_rents.append( [person_id, rent['carId'], rent['price'], rent['rentStatus'] ] )
    pd.read_csv(f"{path}/data.csv", delimiter=';' )
person_id;name;flag;cardId;price;rentStatus
1000;Joseph;1;6638;1000;active
1000;Joseph;1;5566;2000;active

Response example

{
    "active": false,
    "ctodx": false,
    "rents": [{
            "carId": 6638,
            "price": 1000,
            "rentStatus": "active"
        }, {
            "carId": 5566,
            "price": 2000,
            "rentStatus": "active"
        }
    ],
    "responseCode": "OK",
    "status": [{
            "request": 345,
            "requestStatus": "F"
        }, {
            "requestId": 678,
            "requestStatus": "P"
        }
    ],
    "transaction": false
}
  1. After save the additional data from response on csv, i need to get data from another endpoint using the carId on the URL. The mileage result must be save in the same csv. http://api.myendpoint.intranet/get-mileage/6638 http://api.myendpoint.intranet/get-mileage/5566

The return for each call will be like this

{"mileage":1000.0000}
{"mileage":550.0000}

The final output must be

person_id;name;flag;cardId;price;rentStatus;mileage
1000;Joseph;1;6638;1000;active;1000.0000
1000;Joseph;1;5566;2000;active;550.0000

SOmeone can help me with this script? Could be with pandas or any python 3 lib.

3

There are 3 answers

0
Trenton McKinney On BEST ANSWER

Code Explanation

  • Create dataframe, df, with pd.read_csv.
    • It is expected that all of the values in 'person_id', are unique.
  • Use .apply on 'person_id', to call prepare_data.
    • prepare_data expects 'person_id' to be a str or int, as indicated by the type annotation, Union[int, str]
  • Call the API, which will return a dict, to the prepare_data function.
  • Convert the 'rents' key, of the dict, into a dataframe, with pd.json_normalize.
  • Use .apply on 'carId', to call the API, and extract the 'mileage', which is added to dataframe data, as a column.
  • Add 'person_id' to data, which can be used to merge df with s.
  • Convert pd.Series, s to a dataframe, with pd.concat, and then merge df and s, on person_id.
  • Save to a csv with pd.to_csv in the desired form.

Potential Issues

  • If there's an issue, it's most likely to occur in the call_api function.
  • As long as call_api returns a dict, like the response shown in the question, the remainder of the code will work correctly to produce the desired output.
import pandas as pd
import requests
import json
from typing import Union

def call_api(url: str) -> dict:
    r = requests.get(url)
    return r.json()

def prepare_data(uid: Union[int, str]) -> pd.DataFrame:
    
    d_url = f'http://api.myendpoint.intranet/get-data/{uid}'
    m_url = 'http://api.myendpoint.intranet/get-mileage/'
    
    # get the rent data from the api call
    rents = call_api(d_url)['rents']
    # normalize rents into a dataframe
    data = pd.json_normalize(rents)
    
    # get the mileage data from the api call and add it to data as a column
    data['mileage'] = data.carId.apply(lambda cid: call_api(f'{m_url}{cid}')['mileage'])
    # add person_id as a column to data, which will be used to merge data to df
    data['person_id'] = uid
    
    return data
    

# read data from file
df = pd.read_csv('file.csv', sep=';')

# call prepare_data
s = df.person_id.apply(prepare_data)

# s is a Series of DataFrames, which can be combined with pd.concat
s = pd.concat([v for v in s])

# join df with s, on person_id
df = df.merge(s, on='person_id')

# save to csv
df.to_csv('output.csv', sep=';', index=False)
  • If there are any errors when running this code:
    1. Leave a comment, to let me know.
    2. edit your question, and paste the entire TraceBack, as text, into a code block.

Example

# given the following start dataframe
   person_id    name  flag
0       1000  Joseph     1
1        400     Sam     1

# resulting dataframe using the same data for both id 1000 and 400
   person_id    name  flag  carId  price rentStatus  mileage
0       1000  Joseph     1   6638   1000     active   1000.0
1       1000  Joseph     1   5566   2000     active   1000.0
2        400     Sam     1   6638   1000     active   1000.0
3        400     Sam     1   5566   2000     active   1000.0
0
Stephan Schlecht On

There are many different ways to implement this. One of them would be, like you started in your comment:

  • read the CSV file with pandas
  • for each line take the person_id and build a call
  • the delivered JSON response can then be taken from the rents
  • the carId is then extracted for each individual rental
  • finally this is collected in a row_list
  • the row_list is then converted back to csv via pandas

A very simple solution without any error handling could look something like this:

from types import SimpleNamespace

import pandas as pd
import requests
import json

path = '/some/path/'
df = pd.read_csv(f'{path}/data.csv', delimiter=';')

rows_list = []
for _, row in df.iterrows():
    rentCall = f'http://api.myendpoint.intranet/get-data/{row.person_id}'
    print(rentCall)
    response = requests.get(rentCall)
    r = json.loads(response.text, object_hook=lambda d: SimpleNamespace(**d))
    for rent in r.rents:
        mileageCall = f'http://api.myendpoint.intranet/get-mileage/{rent.carId}'
        print(mileageCall)
        response2 = requests.get(mileageCall)
        m = json.loads(response2.text, object_hook=lambda d: SimpleNamespace(**d))
        state = "active" if r.active else "inactive"
        rows_list.append((row['person_id'], row['name'], row['flag'], rent.carId, rent.price, state, m.mileage))
df = pd.DataFrame(rows_list, columns=('person_id', 'name', 'flag', 'carId', 'price', 'rentStatus', 'mileage'))
print(df.to_csv(index=False, sep=';'))
0
Ivo Merchiers On

Speeding up with multiprocessing

You mention that you have 3000 rows, which means that you'll have to make a lot of API calls. Depending on the connection, every one of these calls might take a while. As a result, performing this in a sequential way might be too slow. The majority of the time, your program will just be waiting on a response from the server without doing anything else. We can improve this performance by using multiprocessing.

I use all the code from Trenton his answer, but I replace the following sequential call:

# call prepare_data
s = df.person_id.apply(prepare_data)

With a parallel alternative:

from multiprocessing import Pool
n_processes=20  # Experiment with this to see what works well
with Pool(n_processes) as p:
  s=p.map(prepare_data, df.person_id)

Alternatively, a threadpool might be faster, but you'll have to test that by replacing the import with from multiprocessing.pool import ThreadPool as Pool.