I am trying to multithread a simple application using python. It consists in loading a large amount of data from a format, treating it and saving it infos from this in a csv format with the help of panda. Basically here is a symbolic code for it:
def read(time):
#reagind it from a I/O lib format
def treat(time):
#performing scipy operations
def write(time):
#write it thanks to pandas
df.to_csv(f'partial_{time}_data.csv')
def thread(time):
read(time)
load(time)
write(time)
if __name__ = "__main__":
schedule = [ list of times to load, treat and write ]
#single thread version
for time in schedule:
thread(time)
#pooling version
from tqdm.contrib.concurrent import process_map
process_map(thread, schedule, max_workers=16)
IIUC process_map from tqdm uses concurrent.futures.ProcessPoolExecutor under the hood.
I was surprised to gain only a 2x on the observed execution time running it on a 8-core Intel processor. Is there a more clever way to leverage multiprocessing resources here ?
Thanks for the help