I'm trying to optimize run time of an embarrassingly parallel code. I'm hoping there is something out there that can do this already and I just didn't see it in my searches.
Info...
- The code is embarrassingly parallel
- There are 100 runs overall, and I would like to be able to run 20 instances concurrently
- Runtime and RAM requirements scale ~linearly from job 1 (5 hours, 20GB) to 100 (10 hours, 30GB)
- Individual instances of the code use threaded GEMM calls (currently Intel MKL, 2 threads per job)
- The computer has 2 NUMA nodes (Dual socket system)
- I am restricted on the amount of RAM (I cannot do all of the higher memory runs at one time)
- Currently some jobs cross over between NUMA nodes and slow everything to a crawl
How can I optimally schedule these jobs in terms of time/RAM placement in a NUMA aware fashion? Is there a script or scheduling system that can handle doing this already or would I need to roll my own?
I feel like I have more questions about implementing this, but I think it will just make it more confusing.