Thoughts on Software Development - Avoiding Single Points of Failure

Thursday, May 24, 2012

To avoid a single point of failure you needed to have redundancy.

There are several power lines between a power station and a customer. When an overload is detected by a circuit breaker, power can be rerouted along another line. Each individual line is normally run at reduced capacity so it can handle the increased load.

Redundancy, however, introduces problems. Business applications have state. One electron is as good as another and can be easily rerouted; application state is tied to particular users, sessions, or transactions. I need to access parts of my current order. If the network connection I am on gets broken, substituting someone else's order does not work. Parallel processing of mathematical algorithms (such as MapReduce, or multi-threaded versions of Linq) cannot solve this problem.

You need to remember information about your users, or what products are on order. Some of this data changes rarely if at all; some of this data changes dynamically. Some of this data is very valuable, some of it is not. In a non-cloud application you might use data base mirroring. But you pay a performance penalty for a database mirror because the transaction must commit on both the original and the mirror. Imagine the cost (latency and throughput) of trying to keep redundant databases in different geographic areas in sync. Hence, you must deal with state explicitly.

You have to reduce the parts of your application that handle state to a minimum. Loss of a component means the loss of application state that it holds.

This is classic advice for scaling systems, but is more critical in the cloud. In the past we created stateless middle tier components for easier scalability, and relied on the data tier to handle scale. You then rely on clusters or technology built into the database (transaction logs, etc.). While stateless components are still an excellent idea, we cannot always rely on the database to scale, and the possibility of database failure is a large risk.

The business layer and the domain layer should be stateless, and to the greatest extent possible the clients should hold whatever state they can. Stateless services can be added or removed easily to handle changes in demand or the failure of a service instance, since you have decoupled functionality. Try to build services that are as atomic as possible because that makes it easier to scale or recover from failure by using redundancy. Here, atomic does not mean small.

Suppose you put customer information in the same service as your catalog service. If the customer information service goes down, so does the catalog. You should be able to check in people to your flight even if you cannot give the prices for tomorrow's flights.

There are other techniques you can use. For example, if the bill paying service is not available, you can usually use a transactional queue, and just retry the payment until the service is available. That is one of the reasons why banks, for example, say that they need 24 hours for an electronic payment that could be processed immediately. The odds that the payment service will be down for 24 hours are miniscule.

Also don't forget infrastructure pieces as to where logs are stored. You should probably back them up frequently.

So what about where state has to be stored, say in the data tier? This is a complicated problem which I have discussed this before in both a blog post and a presentation . It basically boils down to the question: what is the acceptable level of data loss? As it turns out, in many applications you do not have to be absolutely consistent in all places, and you can relax consistency constraints to get scalability and reliability. Now you may think this is ridiculous, data can never be lost. But think about how business is actually done.

Airline reservation systems separate out the flight query database, from the transactional database where flights are booked. As a result, occasionally a flight or a price you thought was available is not. But if they did not do that, the performance of making reservations would be very poor. More business would be lost under strict consistency than under relaxed consistency constraints. Why does Amazon use an email system for notifying you of your book order? To scale the user interface and to avoid performance problems due to abandoned shopping carts. You have to ask the question: What is the cost of an apology for the data loss?

What kind of availability are your customers actually willing to pay for, as opposed to what they say they want? Given a choice, do they really want absolute consistency in all cases? If you used replication over a short period of time to another data center, and a data center was destroyed by a hurricane, and a certain amount of data was lost, how terrible would that be?

Consider grouping your components in units of failure. If you have components of type α, β, and γ, you could put all α on a single host, all β on a separate host, or all γ on a separate host. Or you could put groups of α, β, γ on separate hosts. The latter would cause a complete failure on a loss assuming affinity to a given machine. The former would only result in a failure of α functionality.

What do you do if your components fail? Do you have to reroute traffic to another data store, or data center? Do you need to add more instances of a component? You have to monitor your components and understand why they fail. Under certain circumstances, you can degrade performance. Caching might help in keeping your application running under failure.

Degrade gracefully and predictably. Know what you can live without.

5/24/2012 7:44:21 AM (Eastern Standard Time, UTC-05:00) | Comments [0] | Cloud Computing | SOA | Software Development

Comments are closed.