Designing for Failure has been around a lot longer than cloud computing. As we have discussed in several other blog posts, cloud computing, as opposed to hosting in the cloud, is about the ability to acquire or release computing resources as necessary. Acquiring more resources allows you to keep up with demand, or to compensate for a failed instance of a resource.
You must examine every source of dependency in your application: third party libraries, hardware, software interfaces between parts of your own application, TCPIP ports, DNS servers, message queues, database drivers, database size, latencies, to name just a few. These include third party services such as credit card processors, fraud detection services, and geocoding services.
You also have to examine your queries because small queries can become large overnight as you scale. This is why search providers limit the result set that they return. See what kind of joins your ORM is producing when it handles inheritance. Look at the number of objects coming back from a DCOM or RMI call.
Any one of these could fail, or cause latency. As we discussed in a previous post, any potential long latency has to be treated as a potential failure. You need to avoid single points of failure because they are potential bottlenecks or failure points.
Acquiring more resources costs more money. So every strategy is a tradeoff between keeping the application responding (available and scalable) and how much it costs. This is driven of course, by what customers are willingness to pay. Every strategy has to undergo a cost benefit analysis.
The more you approach 100% availability, however, the more the law of diminishing return sets in. I cannot here tell you which problems you should solve, and which you can safely ignore. It depends on your application and your customers. I can tell you that every component is a potential source of failure.
Avoid single points of failure. Accept the fact that you have to build a distributed system.