I'm developing a microservices-based system where one of the microservices is a scheduler responsible for handling a large number of dynamic jobs. These jobs are created, modified, and deleted during runtime via the microservice's API.

My plan is to use Kubernetes CronJobs to manage these jobs. However, I'm concerned about the scalability and performance implications of potentially dealing with thousands of dynamically created CronJobs.

Is using Kubernetes CronJobs a recommended approach for efficiently managing a large number of dynamic jobs within a microservices architecture? If not, what alternative strategies or best practices should I consider for this use case? Any insights or recommendations from the community would be greatly appreciated.

1

There are 1 answers

0
Cosmin Ioniță On BEST ANSWER

I use CronJobs for many years, but for lower job volumes (max 10 - 20 scheduled jobs per day) and I didn't have any problems.

If the number of jobs goes up to thousands, I assume that:

  • The CronJob Controller should be able to handle this workload, even if it relies on some API calls to get cluster-state information. It will eventually become a bottleneck, but for bigger job volumes I would say. Have to be tested though.
  • However, if for some reason the job doesn't get scheduled at the proper time (controller is down, or the job starts slow), and this happens for more than 100 times, the controller will not schedule that job again, so it will enter into a no-op state.
  • This is problematic, because your application functionality depends heavily on the cluster state and its operational stability.

Now apart of the performance aspects, when you schedule a job, you may need to manage the images (and maybe the env vars for each job), which can quickly turn into a nightmare unless you build a better way of managing the job configuration from your app.

So what I suggest is to create a PoC that validates/invalidates that approach. Schedule a number of jobs that progressively grows from 10 to 5000, with a dummy workload, and you'll see what happens, how big the K8s cluster should be, what issues may come up, etc.

The alternative is to embed this scheduling logic into the app itself, by using some scheduled services (lots of libs have this, depends on your tech stack). This pros is that you're going to have better observability and better control.