I need to speed up processing time for tasks which involve rather large data sets that are loaded from up to 1.5GB large pickle files with CSV data. I started with python's multiprocessing but for unpickleable class objects I had to switch to pathos. I got some parallel code running and replicated the results obtained from the usual serial runs : so far so good. but processing speed is far from anything useful with the central main node running full throttle while the actual sub-processes (sometimes hundreds in total) run serially and not parallel, separated by huge gaps of time with the main node again running crazy - for what? It 'best' worked with ProcessPool, while apipe and amap were alike without a difference.
below my code excerpt, first with the parallel attempt, second with the serial portion. both give the same result, but the parallel approach is way slower. Importantly each parallel sub-process uses approx the same amount of time as in the serial loop. All variables are preloaded much earlier in a long processing pipeline.
#### PARALLEL multiprocessing
if me.MP:
import pathos
cpuN = pathos.multiprocessing.cpu_count() - 1
pool = pathos.pools.ProcessPool( cpuN) # ThreadPool ParallelPool
argsIn1 = [] # a mid-large complex dictionary
argsIn2 = [] # a very large complex dictionary (CSV pickle of 400MB or more)
argsIn3 = [] # a list of strings
argsIn4 = [] # a brief string
for Uscr in UID_lst:
argsIn1.append( me)
argsIn2.append( Db)
argsIn3.append( UID_lst)
argsIn4.append( Uscr)
result_pool = pool.amap( functionX, argsIn1, argsIn2, argsIn3, argsIn4)
results = result_pool.get()
for result in results:
[ meTU, U] = result
me.v[T][U] = meTU[U] # insert result !
#### SERIAL processing
else:
for Uscr in UID_lst:
meTU, U = functionX( me, Db, UID_lst, Uscr)
me.v[T][U] = meTU[U] # insert result !
I tested this code on two linux machines, an i3 CPU (with 32GB RAM, slackware 14.2, python 3.7) and on a 2*Xeon box (with 64GB RAM, slackware current, python 3.8). pathos 0.2.6 was installed with pip3. as said, both machines show the same speed problem with the code shown here.
What do I miss here ?
ADDENDUM : it appears that only the very first PID is doing the entire job through all items in UID_lst - while the other 10 sub-processes are idle waiting for nothing, as seen with top and with os.getpid(). cpuN is 11 in this example.
ADDENDUM 2 : sorry about this new revision but running this code under different loads (many more jobs to solve) involved finally more than just one sub-process busy, but just after a long time ! here a top output :
top - 14:09:28 up 19 days, 4:04, 3 users, load average: 6.75, 6.20, 5.08
Tasks: 243 total, 6 running, 236 sleeping, 0 stopped, 1 zombie
%Cpu(s): 48.8 us, 1.2 sy, 0.0 ni, 49.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 64061.6 total, 2873.6 free, 33490.9 used, 27697.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 29752.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5441 userx 20 0 6597672 4.9g 63372 S 100.3 7.9 40:29.08 python3 do_Db_job
5475 userx 20 0 6252176 4.7g 8828 R 100.0 7.5 9:24.46 python3 do_Db_job
5473 userx 20 0 6260616 4.7g 8828 R 100.0 7.6 17:02.44 python3 do_Db_job
5476 userx 20 0 6252432 4.7g 8828 R 100.0 7.5 5:37.52 python3 do_Db_job
5477 userx 20 0 6252432 4.7g 8812 R 100.0 7.5 1:48.18 python3 do_Db_job
5474 userx 20 0 6253008 4.7g 8828 R 99.7 7.5 13:13.38 python3 do_Db_job
1353 userx 20 0 9412 4128 3376 S 0.0 0.0 0:59.63 sshd: userx@pts/0
1354 userx 20 0 7960 4692 3360 S 0.0 0.0 0:00.20 -bash
1369 userx 20 0 9780 4212 3040 S 0.0 0.0 31:16.80 sshd: userx@pts/1
1370 userx 20 0 7940 4632 3324 S 0.0 0.0 0:00.16 -bash
4545 userx 20 0 5016 3364 2296 R 0.0 0.0 3:01.76 top
5437 userx 20 0 19920 13280 6544 S 0.0 0.0 0:00.07 python3
5467 userx 20 0 0 0 0 Z 0.0 0.0 0:00.00 [git] <defunct>
5468 userx 20 0 3911460 2.5g 9148 S 0.0 4.0 17:48.90 python3 do_Db_job
5469 userx 20 0 3904568 2.5g 9148 S 0.0 4.0 16:13.19 python3 do_Db_job
5470 userx 20 0 3905408 2.5g 9148 S 0.0 4.0 16:34.32 python3 do_Db_job
5471 userx 20 0 3865764 2.4g 9148 S 0.0 3.9 18:35.39 python3 do_Db_job
5472 userx 20 0 3872140 2.5g 9148 S 0.0 3.9 20:43.44 python3 do_Db_job
5478 userx 20 0 3844492 2.4g 4252 S 0.0 3.9 0:00.00 python3 do_Db_job
27052 userx 20 0 9412 3784 3052 S 0.0 0.0 0:00.02 sshd: userx@pts/2
27054 userx 20 0 7932 4520 3224 S 0.0 0.0 0:00.01 -bash
it appears to me that max 6 sub-processes will run at any given time which may correspond with the psutil.cpu_count(logical=False) = 6, instead of the pathos.multiprocessing.cpu_count() = 12... ?
Actually, the problem got solved - which it turns out never was one in first place at this development stage of my code. The problem was elsewhere : the variables the worker processes are supplied with are very large, many gigabytes at times. This scenario will keep the main/central node busy for ever with variable dill/undill (like pickle/unpickle) even on the new dual-xeon machine, not to speak about the old i3 CPU box. With the former I saw up to 6 or 7 workers active (out of 11) while the latter never even got to more than 1 active worker, and even for the former machine it took a huge amount of time, dozens of minutes, to see a few workers accumulating in top.
so, I will need to adjust the code so that each worker will have to re-read the huge variables from disk/network - which also takes some time but it make sense to free the central node from this silly repeated task but instead give it a chance to do the job it was designed for, which is scheduling and organizing the workers' show.
I am also happy to say that the resulting output, a CSV file (wc: 36722 1133356 90870757), is identical for the parallel run as compared to the traditional serial version.
Having said this, I am truly surprised how convenient it is to use python/pathos - and not having to change the relevant worker code between the serial and parallel runs !