Spring Cloud Data Flow treats intentionally failing health check app as healthy. Skipper's red-black deployment is broken

452 views Asked by At

I'm playing with SCDF for couple of days checking different scenarios as we may want to use it in one of our projects.

I use

One of scenarios that I check is how SCDF handles failures of apps.

One of app's versions on purpose fails health check constantly returning 503 and status DOWN. App is spring boot app. When stream is updated and broken version of the app is used previous version is killed right after update. At UI it also showing new version as "deploying". If some times passes and number of probes fails - pod restarts. If you wait for couple of restarts you can see that eventually status of app is changed to "failed". However with each new restart it resets to "deployed" then moving to "failed".

Observed behavior doesn't align with what is written in doc on "red\black" deployment strategy of Skipper.

https://docs.spring.io/spring-cloud-dataflow/docs/2.5.1.RELEASE/reference/htmlsingle/#_skipper_s_upgrade_strategy

Skipper has a simple 'red/black' upgrade strategy. It deploys the new version of the applications, using as many instances as the currently running version, and checks the /health endpoint of the application. If the health of the new application is good, then the previous application is undeployed. If the health of the new application is bad, then all new applications are undeployed and the upgrade is considered to be not successful.

What's wrong

My expectation is that Skipper should preserve existing version of the app until it's sure that new version is healthy. And now it looks like it kills healthy version and deploys broken one leaving it spinning in crash loop.

Steps to reproduce the issue

  • stream called my-pipeline is deployed (v48)
  • one of app's called my-app is deployed with v48
  • update stream and change version for my-app to broken v49
  • my-app-v48 is killed
  • my-app-v49 is started but running in crash loop constantly failing health checks and trying to restart

I checked Skipper' logs and can see following lines:

    2020-07-23 12:32:27.785  INFO 1 --- [eTaskExecutor-4] o.s.c.s.server.deployer.ReleaseAnalyzer  : 
    Existing Package and Upgrade package both have no top level templates
    2020-07-23 12:32:27.786  INFO 1 --- [eTaskExecutor-4] o.s.c.s.server.deployer.ReleaseAnalyzer  : 
    Differences detected between existing and replacing application manifests.Upgrading applications 
    = [my-app]
    2020-07-23 12:32:27.944  INFO 1 --- [eTaskExecutor-4] o.s.c.s.s.d.s.HandleHealthCheckStep      : 
    Release my-pipeline-v49 has been DEPLOYED
    2020-07-23 12:32:27.944  INFO 1 --- [eTaskExecutor-4] o.s.c.s.s.d.s.HandleHealthCheckStep      : 
    Apps in release my-pipeline-v49 are healthy.
    2020-07-23 12:32:27.954  INFO 1 --- [eTaskExecutor-4] o.s.c.s.s.d.s.HandleHealthCheckStep      : 
    Deleting changed applications from existing release my-pipeline-v48 

Looks like HandleHealthCheckStep#handleHealthCheck is called with healthy set to true. I guess it happens due to "deploying" or "deployed" (first status of the app on restart) are treated as healthy.

Let me know if I need to provide more details.

Update : How statuses look

  1. Liveness is at /actuator/health which fails, readiness is at /actuator/info which is ok.

k8s status

pod's status is : "0/1 Running", after delay of readiness probe it goes to "1/1 Running". After failing liveness probes pod is restarted and goes back to "0/1 Running" with increased restarts count.

scdf status

before 3 first restarts. App's status is "deploying" the same for the stream. After delay for readiness it goes to "deployed" and the same for stream.

after 3 first restarts. App's status is "failed" for stream it's "partial". After delay for readiness it goes to "deployed" and the same for stream.

  1. Also just tried with setting both liveness and rediness both to failing actuator/health.

k8s status

pod's status is : "0/1 Running" with increasing number of restarts

scdf status

before 3 first restarts. App's status is "deploying" the same for the stream.

after 3 first restarts. App's status is "failed" for stream it's "partial".

However in both scenarios existing version of app is killed right after stream's upgrade.

0

There are 0 answers