I am trying to build a pandas dataframe as follows. For each observation (patent) i:
- Use an API to retrieve information on patent i transfers;
- Transform the object of the API to a pandas dataframe (each one typically containing 2 to 10 entries);
- Take each of the objects one by one and store them in a list initiating another for loop for the k entries of the dataset. If one of the three entries that I want from the pandas dataframe is not available then replace it with np.nan (or '' in case of a string) and fill the others if possible.
Now below I present a MWE with 100 observations. The code works however it takes 40 seconds for 100 observations which scales up to 10 days for my entire dataset that has 2295531 observations, thus the whole process takes on forever.
from patent_client import Inpadoc, Assignment, USApplication
import numpy as np
now = time.time()
import concurrent.futures
import pandas as pd
import numpy as np
def fetch_data(patent_number):
assignments = Assignment.objects.filter(patent_number=patent_number)
# Convert the data to a DataFrame, assuming it's not already
#assignments_df = pd.DataFrame(assignments)
assignments_df =assignments.to_pandas()
results = []
for idx, row in assignments_df.iterrows():
try:
trans_date = row["transaction_date"]
except:
trans_date = row["recorded_date"]
#except:
# trans_date = np.nan
try:
trans_id = row["id"]
except:
trans_id = np.nan
try:
assignee = row["assignees"][0]['name']
except:
assignee = ''
results.append([patent_number, trans_date, trans_id, assignee])
return results
# List to hold results
final_results = []
patent_numbers=[7893226,
7766814,
8956291,
7752988,
8105333,
8376607,
7676618,
7783807,
8819739,
8667410,
9029637,
7918965,
8760379,
8228379,
8288135,
7833115,
7861321,
7836521,
7841023,
7805770,
7814570,
7854026,
8122522,
8122526,
7879447,
8250682,
7810179,
8893321,
8225434,
7788746,
7757315,
7694807,
8239990,
8065771,
7904987,
7823245,
7707677,
8069530,
8240003,
7866003,
7934621,
7827655,
8037574,
7788771,
7836565,
8316516,
7958607,
7748924,
8001666,
8006362,
7761976,
8161633,
7726012,
8960526,
8037597,
7800916,
7877869,
8458903,
8053081,
8020876,
8453335,
9003676,
7941938,
8006410,
7673399,
8061062,
7836609,
7726051,
7797865,
9564070,
7950173,
7774964,
7716864,
7987631,
8049048,
7846323,
8100988,
8276315,
8110287,
7919723,
8038960,
7815700,
8177868,
8999019,
7699144,
8082694,
8128134,
8201708,
7752815,
7825333,
7854248,
7908817,
8001743,
7905066,
7703249,
7669374,
7958688,
8978335,
7716820,
8161707]
# Using a ThreadPoolExecutor to parallelize the task
with concurrent.futures.ThreadPoolExecutor() as executor:
for result in executor.map(fetch_data, patent_numbers):
final_results.extend(result)
# Convert final results to a DataFrame
df = pd.DataFrame(final_results, columns=['patent', 'trans_date', 'trans_id', 'assignee'])
after = time.time()
print(after-now)
Is there a way to speed up this? I think one of the bottlenecks may be the API slow response, but also the rest of the code may be poorly written. Can someone help me out?