Recently our team encountered 'out of memory' errors in cadence-frontend pod.
{"error":"Error 1135: Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug", "level":"fatal", "logging-call-at":"server.go:281", "msg":"Fail to start frontend service ", "service":"cadence-frontend", "stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Fatal
and 'too many connections' errors
{"error":"Error 1040: Too many connections", "level":"fatal", "logging-call-at":"server.go:281", "msg":"Fail to start frontend service ", "service":"cadence-frontend", "stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Fatal
Currently, we create a new cadence-client worker as soon as we start a workflow, so I suspect we create too many workers to listen on the task queue, but I am not 100% sure about how workers work so I cannot confirm that's the root cause. Here are some questions:
- Will the worker we create for each workflow be killed automatically as soon as the workflow ends (completed or canceled or failed)?
- If not, what is the best way to determine the number of workers we need?
- Is there a way to see how many workers are connected to the cadence cluster?
- Will a worker keep working on one workflow until that workflow ends? i.e., if I only have one worker and a workflow needs human interactions (like waiting for approvals) to take a long time to finish, will it block other incoming workflows?
Any help would be appreciated!
I am going to change the code to use only one worker to see whether our workflows can still work correctly.
 
                        
In the majority of cases you should create one client/worker per application process and do not link them to a specific workflow lifecycle. A single worker instance can process many workflows in parallel. The number of worker processes is defined by how many actions (activities, signals, timers, etc) per second are executed by the system.