How to limit Couchbase client from trying to connect to Couchbase server when it's down?

1.1k views Asked by At

I'm trying to handle Couchbase bootstrap failure gracefully and not fail the application startup. The idea is to use "Couchbase as a service", so that if I can't connect to it, I should still be able to return a degraded response. I've been able to somewhat achieve this by using the Couchbase async API; RxJava FTW.

Problem is, when the server is down, the Couchbase Java client goes crazy and keeps trying to connect to the server; from what I see, the class that does this is ConfigEndpoint and there's no limit to how many times it tries before giving up. This is flooding the logs with java.net.ConnectException: Connection refused errors. What I'd like, is for it to try a few times, and then stop.

Got any ideas that can help?

Edit:

Here's a sample app.

Steps to reproduce the problem:

  1. svn export https://github.com/asarkar/spring/trunk/beer-demo.
  2. From the beer-demo directory, run ./gradlew bootRun. Wait for the application to start up.
  3. From another console, run curl -H "Accept: application/json" "http://localhost:8080/beers". The client request is going to timeout due to the failure to connect to Couchbase, but Couchbase client is going to flood the console continuously.
3

There are 3 answers

0
Abhijit Sarkar On BEST ANSWER

Solved this issue myself. The client I designed handles the following use cases:

  1. The client startup must be resilient of CB failure/availability.
  2. The client must not fail the request, but return a degraded response instead, if CB is not available.
  3. The client must reconnect should a CB failover happens.

I've created a blog post here. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.

6
Matt Ingenthron On

The reason we choose to have the client continue connecting is that Couchbase is typically deployed in high-availability clustered situations. Most people who run our SDK want it to keep trying to work. We do it pretty intelligently, I think, in that we do an exponential backoff and have tuneables so it's reasonable out of the box and can be adjusted to your environment.

As to what you're trying to do, one of the tuneables is related to retry. With adjustment of the timeout value and the retry, you can have the client referenceable by the application and simply fast fail if it can't service the request.

The other option is that we do have a way to let your application know what node would handle the request (or null if the bootstrap hasn't been done) and you can use this to implement circuit breaker like functionality. For a future release, we're looking to add circuit breakers directly to the SDK.

All of that said, these are not the normal path as the intent is that your Couchbase Cluster is up, running and accessible most of the time. Failures trigger failovers through auto-failover, which brings things back to availability. By design, Couchbase trades off some availability for consistency of data being accessed, with replica reads from exception handlers and other intentionally stale reads for you to buy into if you need them.

Hope that helps and glad to get any feedback on what you think we should do differently.

1
Vivek Rai On

Start a separate thread and keep calling ping on it every 10 or 20 seconds, one CB is down ping will start failing, have a check like "if ping fails 5-6 times continuous then close all the CB connections/resources"