One of my colleagues asked me this question what the difference between Circuit Breaker and Retry is but I was not able answer him correctly. All I know circuit breaker is useful if there is heavy request payload, but this can be achieve using retry. Then when to use Circuit Breaker and when to Retry.
Also, it is it possible to use both on same API?
Several years ago I wrote a resilience catalog to describe different mechanisms. Originally I've created this document for co-workers and then I shared it publicly. Please allow me to quote here the relevant parts.
Retry
Categories: reactive, after the fact
There are situation where your requested operation relies on a resource, which might not be reachable in a certain point of time. In other words there can be a temporal issue, which will be gone sooner or later. This sort of issues can cause transient failures. With retries you can overcome these problems by attempting to redo the same operation in a specific moment in the future. To be able to use this mechanism the following criteria group should be met:
Let’s review them one by one:
Circuit Breaker
Categories: proactive, before the fact
This is one of the most complex patterns mainly because it uses different states to define different behaviours. Before we jump into the details lets see why this tool exists at all:
Circuit breaker detects failures and prevents the application from trying to perform the action that is doomed to fail (until it is safe to retry) - Wikipedia
So, this tool works as a mini data and control plane. The requests go through this proxy, which examines the responses (if any) and it counts subsequent failures. If a predefined threshold is reached then the transfer is suspended temporarily and it fails immediately.
It prevents cascading failures. In other words the transient failure of a downstream system should not be propagated to the upstream systems. By concealing the failure we are actually preventing a chain reaction (domino effect) as well.
It must somehow determine when would be safe to operate again as a proxy. For example it can use the same detection mechanism that was used during the original failure detection. So, it works like this: after a given period of time it allows a single request to go through and it examines the response. If it succeeds then the downstream is treated as healthy. Otherwise nothing changes (no request is transferred through this proxy) only the timer is reset.
The circuit breaker can be in any of the following states:
Closed
,Open
,HalfOpen
.Closed
: It allows any request. It counts successive failed requests.Open
Open
: It rejects any request immediately. It waits a predefined amount of time.HalfOpen
HalfOpen
: It allows only one request. It examines the response of that request:Closed
Open
Resiliency strategy
The above two mechanisms / policies are not mutually exclusive, on the contrary. They can be combined via the escalation mechanism. If the inner policy can't handle the problem it can propagate one level up to an outer policy.
When you try to perform a request while the Circuit Breaker is
Open
then it will throw an exception. Your retry policy could trigger for that and adjust its sleep duration (to avoid unnecessary attempts).The downstream system can also inform upstream that it is receiving too many requests with 429 status code. The Circuit Breaker could also trigger for this and use the
Retry-After
header's value for its sleep duration.So, the whole point of this section is that you can define a protocol between client and server how to overcome on transient failures together.