Fault Tolerance in Distributed SystemsPosted: August 8, 2015
Perhaps we are nearly at the point where saying “distributed systems” is as redundant as “software program” always has been, but for the moment I want to consider how a specific issue is heightened by the nature of modern, asynchronous systems, and that issue is “fault tolerance” generally as well as “cascading failures” specifically.
More and more such issues arise — and I was please to read a particularly lucid explanation of a popular and important design pattern used in many solutions: the Circuit Breaker pattern. On Martin Fowler’s blog — haha. I was kind of surprised by that — but only because I don’t google interesting problems in architecture and design nearly as often as I’d like.
I can’t add any value to what he’s written here, so instead i will just quote briefly:
The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.
There are added bits about adding a capability to attempt automatic reset (at some specified interval) and discussions of other real-world refinements (e.g. different thresholds for different sorts of errors), but a hallmark of this sort of writing is that, at least for most of its intended audience, a simple example provided in detail, and pointers to additional kinds of flourishes and add-ons, is really all that is needed.
Check it out! And if you googled this topic, doubtless you have read or seen something about NetFlix’ Hystrix, which says on that getHub landing page:
Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
It is a java implementation; there are other articles linked here and links to alternative Circuit-breaker patterns in Ruby, Java, Grails Plugin, C#, AspectJ, and Scala listed at the bottom of the Fowler blog post.