Fault Tolerance in Distributed Systems

Perhaps we are nearly at the point where saying “distributed systems” is as redundant as “software program” always has been, but for the moment I want to consider how a specific issue is heightened by the nature of modern, asynchronous systems, and that issue is “fault tolerance” generally as well as “cascading failures” specifically.

More and more such issues arise — and I was please to read a particularly lucid explanation of a popular and important design pattern used in many solutions: the Circuit Breaker pattern.  On Martin Fowler’s blog — haha. I was kind of surprised by that — but only because I don’t google interesting problems in architecture and design nearly as often as I’d like.

I can’t add any value to what he’s written here, so instead i will just quote briefly:

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.

There are added bits about adding a capability to attempt automatic reset (at some specified interval) and discussions of other real-world refinements (e.g. different thresholds for different sorts of errors), but a hallmark of this sort of writing is that, at least for most of its intended audience, a simple example provided in detail, and pointers to additional kinds of flourishes and add-ons, is really all that is needed.

Courtesy of Martin Flower

Check it out!  And if you googled this topic, doubtless you have read or seen something about NetFlix’ Hystrix, which says on that getHub landing page:

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

It is a java implementation; there are other articles linked here and links to alternative Circuit-breaker patterns in RubyJavaGrails PluginC#AspectJ, and Scala listed at the bottom of the Fowler blog post.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s