I'm playing with SCDF for couple of days checking different scenarios as we may want to use it in one of our projects.
I use
- helm chart https://hub.helm.sh/charts/stable/spring-cloud-data-flow/2.7.1
- k8s version - 1.16 (AWS EKS eks.1)
- external rabbit and postgres are used - the rest of config is default from vanilla SCDF starter articles.
One of scenarios that I check is how SCDF handles failures of apps.
One of app's versions on purpose fails health check constantly returning 503 and status DOWN. App is spring boot app. When stream is updated and broken version of the app is used previous version is killed right after update. At UI it also showing new version as "deploying". If some times passes and number of probes fails - pod restarts. If you wait for couple of restarts you can see that eventually status of app is changed to "failed". However with each new restart it resets to "deployed" then moving to "failed".
Observed behavior doesn't align with what is written in doc on "red\black" deployment strategy of Skipper.
Skipper has a simple 'red/black' upgrade strategy. It deploys the new version of the applications, using as many instances as the currently running version, and checks the /health endpoint of the application. If the health of the new application is good, then the previous application is undeployed. If the health of the new application is bad, then all new applications are undeployed and the upgrade is considered to be not successful.
What's wrong
My expectation is that Skipper should preserve existing version of the app until it's sure that new version is healthy. And now it looks like it kills healthy version and deploys broken one leaving it spinning in crash loop.
Steps to reproduce the issue
- stream called my-pipeline is deployed (v48)
- one of app's called my-app is deployed with v48
- update stream and change version for my-app to broken v49
- my-app-v48 is killed
- my-app-v49 is started but running in crash loop constantly failing health checks and trying to restart
I checked Skipper' logs and can see following lines:
2020-07-23 12:32:27.785 INFO 1 --- [eTaskExecutor-4] o.s.c.s.server.deployer.ReleaseAnalyzer :
Existing Package and Upgrade package both have no top level templates
2020-07-23 12:32:27.786 INFO 1 --- [eTaskExecutor-4] o.s.c.s.server.deployer.ReleaseAnalyzer :
Differences detected between existing and replacing application manifests.Upgrading applications
= [my-app]
2020-07-23 12:32:27.944 INFO 1 --- [eTaskExecutor-4] o.s.c.s.s.d.s.HandleHealthCheckStep :
Release my-pipeline-v49 has been DEPLOYED
2020-07-23 12:32:27.944 INFO 1 --- [eTaskExecutor-4] o.s.c.s.s.d.s.HandleHealthCheckStep :
Apps in release my-pipeline-v49 are healthy.
2020-07-23 12:32:27.954 INFO 1 --- [eTaskExecutor-4] o.s.c.s.s.d.s.HandleHealthCheckStep :
Deleting changed applications from existing release my-pipeline-v48
Looks like HandleHealthCheckStep#handleHealthCheck is called with healthy set to true. I guess it happens due to "deploying" or "deployed" (first status of the app on restart) are treated as healthy.
Let me know if I need to provide more details.
Update : How statuses look
- Liveness is at /actuator/health which fails, readiness is at /actuator/info which is ok.
k8s status
pod's status is : "0/1 Running", after delay of readiness probe it goes to "1/1 Running". After failing liveness probes pod is restarted and goes back to "0/1 Running" with increased restarts count.
scdf status
before 3 first restarts. App's status is "deploying" the same for the stream. After delay for readiness it goes to "deployed" and the same for stream.
after 3 first restarts. App's status is "failed" for stream it's "partial". After delay for readiness it goes to "deployed" and the same for stream.
- Also just tried with setting both liveness and rediness both to failing actuator/health.
k8s status
pod's status is : "0/1 Running" with increasing number of restarts
scdf status
before 3 first restarts. App's status is "deploying" the same for the stream.
after 3 first restarts. App's status is "failed" for stream it's "partial".
However in both scenarios existing version of app is killed right after stream's upgrade.