first happy new year to everybody and happy coding for 2017.
I have 1M of "tasks" to run using python. Each task will take around 2 min and will process some local images. I would like to run as much as possible in parallel in an automatic way. My server has 40 cores so I started to see how to do multiprocessing but I see the following issues:
- Keeping the log of each task is not easy (I am working on it but so far I didn't succeed even if I found many example on stackoverflow)
- How to I know how many CPU should I use and how many should be left to the server for basic server task ?
- When we have multi user on the server how can we see how many CPU are already used ?
In my previous life as physicist at CERN we were using job submission system to submit tasks on many clusters. Tasks were put in a queue and process when a slot was available. Do we have such tool for a LINUX sever as well? I don't know what is the correct English name for such tool (job dispatcher ?).
The best will be a tool that we can configure to use our N CPU as "vehicle" to process in parallel task (and that keep the needed CPU so that the server can run basic task as well), put the job of all users in a queues with priority and process them "vehicle" are available. Bonus will be a way to monitor task processing.
I hope I am using the correct word to describe what I want.
Thanks Fabien
What you are talking about is generally referred as "Pool of Workers". It can be implemented using Threads or Processes. The implementation choice depends on your workflow.
A pool of workers allows you to choose the number of workers to use. Furthermore, the pool usually has a queue in front of the workers to de-couple them from your main logic.
If you want to run tasks within a single server, then you can either use multiprocessing.Pool or concurrent.futures.Executor.
If you want to distribute tasks over a cluster, there are several solutions. Celery and Luigi are good examples.
EDIT:
This is not your concern as a User. Modern Operating Systems do a pretty good job in sharing resources between multiple Users. If overcommitting resources becomes a concern, the SysAdmin should make sure this does not happen by assigning quotas per User. This can be done in plenty of ways. An example tool sysadmins should be familiar with is ulimit.
To put it in other words: your software should not do what Operating Systems are for: abstracting the underlying machine to offer to your software a "limitless" set of resources. Whoever manages the server should be the person telling you: "use at most X CPUs".
Probably, what you were using at CERN was a system like Mesos. These solutions aggregate large clusters in a single set of resources which you can schedule tasks against. This works if all the users are accessing to the cluster through it though.
If you are sharing a server with other people, either you agree together on the quotas or you all adopt a common scheduling framework such as Celery.