Speed up assignment

79 views Asked by Lusian At 25 October 2023 at 09:40

I am trying to build a pandas dataframe as follows. For each observation (patent) i:

Use an API to retrieve information on patent i transfers;
Transform the object of the API to a pandas dataframe (each one typically containing 2 to 10 entries);
Take each of the objects one by one and store them in a list initiating another for loop for the k entries of the dataset. If one of the three entries that I want from the pandas dataframe is not available then replace it with np.nan (or '' in case of a string) and fill the others if possible.

Now below I present a MWE with 100 observations. The code works however it takes 40 seconds for 100 observations which scales up to 10 days for my entire dataset that has 2295531 observations, thus the whole process takes on forever.

from patent_client import Inpadoc, Assignment, USApplication
import numpy as np

now = time.time()
import concurrent.futures
import pandas as pd
import numpy as np

def fetch_data(patent_number):
    assignments = Assignment.objects.filter(patent_number=patent_number)
    
    # Convert the data to a DataFrame, assuming it's not already
    #assignments_df = pd.DataFrame(assignments)
    assignments_df =assignments.to_pandas()
    
    results = []
    
    for idx, row in assignments_df.iterrows():
        try:
            trans_date = row["transaction_date"]
        except:
            trans_date = row["recorded_date"]
        #except:
        #    trans_date = np.nan
        
        try:
            trans_id = row["id"]
        except:
            trans_id = np.nan
        
        try:
            assignee = row["assignees"][0]['name']
        except:
            assignee = ''
        
        results.append([patent_number, trans_date, trans_id, assignee])
        
    return results

# List to hold results
final_results = []


patent_numbers=[7893226,
 7766814,
 8956291,
 7752988,
 8105333,
 8376607,
 7676618,
 7783807,
 8819739,
 8667410,
 9029637,
 7918965,
 8760379,
 8228379,
 8288135,
 7833115,
 7861321,
 7836521,
 7841023,
 7805770,
 7814570,
 7854026,
 8122522,
 8122526,
 7879447,
 8250682,
 7810179,
 8893321,
 8225434,
 7788746,
 7757315,
 7694807,
 8239990,
 8065771,
 7904987,
 7823245,
 7707677,
 8069530,
 8240003,
 7866003,
 7934621,
 7827655,
 8037574,
 7788771,
 7836565,
 8316516,
 7958607,
 7748924,
 8001666,
 8006362,
 7761976,
 8161633,
 7726012,
 8960526,
 8037597,
 7800916,
 7877869,
 8458903,
 8053081,
 8020876,
 8453335,
 9003676,
 7941938,
 8006410,
 7673399,
 8061062,
 7836609,
 7726051,
 7797865,
 9564070,
 7950173,
 7774964,
 7716864,
 7987631,
 8049048,
 7846323,
 8100988,
 8276315,
 8110287,
 7919723,
 8038960,
 7815700,
 8177868,
 8999019,
 7699144,
 8082694,
 8128134,
 8201708,
 7752815,
 7825333,
 7854248,
 7908817,
 8001743,
 7905066,
 7703249,
 7669374,
 7958688,
 8978335,
 7716820,
 8161707]
# Using a ThreadPoolExecutor to parallelize the task
with concurrent.futures.ThreadPoolExecutor() as executor:
    for result in executor.map(fetch_data, patent_numbers):
        final_results.extend(result)

# Convert final results to a DataFrame
df = pd.DataFrame(final_results, columns=['patent', 'trans_date', 'trans_id', 'assignee'])

after = time.time()

print(after-now)

Is there a way to speed up this? I think one of the bottlenecks may be the API slow response, but also the rest of the code may be poorly written. Can someone help me out?

Original Q&A

TechQA.

Speed up assignment

There are 0 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in NUMPY

Related Questions in GOOGLE-PATENT-SEARCH

Popular Questions

Popular Tags

Trending Questions