I'm running flask restplus api on google container engine with TCP Load Balancer. The flask restplus api makes calls to google cloud datastore or cloud sql but this does not seem to be the problem.
A few times a day or even more, there is a moment of latency spikes. Restarting the pod solves this or it solves itself in a 5 to 10 minute period. Of course this is too much and needs to be resolved.
Anyone knows what could be the problem or has experience with these kind of issues?
Thx
One thing you could try is monitoring your instance CPU load.
Although the latency doesn't correspond with usage spikes, it may be the case that there is a cumulative effect on CPU load and the latency you're experiencing occurs when the CPU reaches a given % and needs to back off temporarily. If this is the case you could make use of cluster autoscaling, or try running a higher spec machine to see if that makes any difference. Or, if you have limited CPU use on pods/containers, try increasing this limit.
If you're confident CPU isn’t the cause of the issue, you could try to SSH into the affected instance when the issue is occurring, send a request through the load balancer and use tcpdump to analyse the traffic coming in and out. You may be able to spot if the latency stems from the load balancer (by monitoring the latency of HTTP traffic to the instance), or to Cloud Datastore or Cloud SQL (from the instance).
Alternatively, try using strace to monitor the relevant processes both before and during the latency, or dtrace to monitor the system as a whole.