Thoughts on Software Development - Transient Failures Don't Exist

Transient Failures Don't Exist

In the simple example we have been discussing, the consequences of a failure appear immediately to the user. In a more complicated architecture there are many more tiers and many more dependencies. With more dependencies, more problems can possibly result from poor decisions on how to handle failure. Those dependencies include other applications in your own shop, third party libraries that you don't control, the internet, etc. For example, if your order queues fail, you cannot do orders. If your customer service app fails, you cannot retrieve member information. Unhandled failures propagate (like cracks) throughout your application.

Failures Cascade - a unhandled failure in one part of your system becomes a failure of your application.

In deciding on how to respond to failure we have to distinguish between two types of failure: transient failures and resource failures. Transient failures are not due to component failure, but a resource temporarily under load that cannot respond as fast as you had assumed. With resource failures you have to have an alternative strategy because a component is not available.

Transient failures occur for short periods of time. The typical response is to retry the operation after a short period of time. But questions still remain. How often do you retry? What is a short period of time? What do you with the data during the retry? On the other hand, remember that just as failures cascade, so do delays. While you are waiting or retrying, scarce resources are being used (threads, memory, TCPIP ports, database connections) that cannot be used for other requests.

Go back to our WCF example and look at the try/catch block. If we had to do a retry how would we change the logic? We will have to adopt a whole different strategy to handle failure, a retry loop inside the catch handler would not be enough because other failures could reoccur, and not every error would allow a retry. You have to design the entire routine around expecting failure, because as we discussed in our last post, failures happen.

Since slow responses usually come from resource bottlenecks, you have to treat them as failures if you are going to have reasonable availability. Which means that transient failure can soon look like resource failures. So what do you do? Retry for a limited amount time and then give up. In addition, never block on an I/O, timeout and assume failure.

From the point of view of architecture and design, there is really no such thing as a transient failure. If you have a transient failure, fail fast and treat it as a resource failure.