Windows Azure provides two storage mechanisms: SQL Azure and Azure Storage tables. Which one should you use?
Can Relational Databases Scale?
SQL Azure is basically SQL Server in the cloud. To get meaningful results from a query, you need a consistent set of data.
Transactions allow for data to be inserted according to the ACID principle: all related information is changed together.
The longer the database lock manager keeps locks, the higher the likelihood two transactions will modify the same data. As transactions wait for locks to clear, transactions will either be slower to complete, or transactions will time out and must be abandoned or retried. Data availability decreases.
Content distribution networks enable read-only data to be delivered quickly to overcome the speed of light boundary. They are useless for modifiable data. The laws of physics drive a set of diminishing economic returns on bandwidth. You can only move so much data so fast.
Jim Gray pointed out years ago that computational power gets cheaper faster than network bandwidth. It makes more economic sense to compute where the data is rather than moving it to a computing center. Data is often naturally distributed.
Is connectivity to that data always possible? Some people believe that connectivity will be always available. Cell phone connectivity problems, data center outages, equipment upgrades, and last mile problems indicate that is never going to happen.
Computing in multiple places leads to increased latency. Latency means longer lock retention. Increased locked retention means decreased availability.
Most people think of scaling in terms of large number of users: Amazon, Facebook, or Google. Latency also leads to scalability based on geographic distribution of users, transmission of a large quantity of data, or any bottleneck that lengthens the time of a database transaction.
The economics of distributed computing argue in favor of many small machines, rather than one large machine. Google does not handle its search system with one large machine, but many commodity processors. If you have one large database, scaling up to a new machine can cost hours or days.
The CAP Theorem
Eric Brewer’s CAP Theorem summarizes the discussion. Given the constraints of consistency, availability, and partitioning, you
can only have two of the three. We are comfortable with the world of single database/database cluster with minimal latency where we have consistency and availability.
If we are forced to partition our data should we give up on availability or consistency? Let us first look at the best way to partition, and then ask whether we want consistency or availability.
What is the best way to partition?
If economics, the laws of physics, and current technology limits argue in favor of partitioning, what is the best way to partition? Distributed objects, whether by DCOM, CORBA, or RMI failed for many reasons . The RPC model increases latencies that inhibit scalability. You cannot ignore the existence of the network. Distributed transactions fail as well because once you get beyond a local network the latencies with two-phase commit impede scalability.
Two better alternatives exist: a key value/type store such as Azure Storage Services, or partitioning data across relational databases without distributed transactions.
Storage Services allow multiple partitions of tables with entries. Only CRUD operations exist: no foreign key relations, no joins, no constraints, and no schemas. Consistency must be handled programmatically. This model works well with tens of hundreds of commoity processors, and can achieve massive scalability.
One can partition SQL Azure horizontally or vertically. With horizontal partitioning we divide table rows across the database. With vertical partitioning we divide table columns across databases. Within the databases you have transactional consistency, but there are no transactions across databases.
Horizontal partitioning works especially well when the data divides naturally: company subsidiaries that are geographically separate, historical analysis, or of different functional areas such as user feedback and active orders. Vertical partitioning works well when updates and queries use different pieces of data.
In all these cases we have to deal with data that might be stale or inconsistent.
Consistency or Availability?
Ask a simple question: What is the cost of an apology?
The number of available books in Amazon is a cached value, not guaranteed to be correct. If Amazon ran a distributed transaction over all your shopping cart orders, the book inventory system, and the shipping system, they could never build a massively scalable front end user interface. Transactions would be dependent on user interactions that could range from 5 seconds to hours, assuming the shopping cart is not abandoned. It is impractical to keep database locks that long. Since most of the time you get your book, availability is a better choice that consistency.
Airline reservation systems are similar. A database used for read-only flight shopping is updated periodically. Another database is for reservations. Occasionally, you cannot get the price or flight you wanted. Using one database to achieve consistency would make searching for fares. or making reservations take forever.
Both cases have an ultimate source of truth: the inventory database, or the reservations database.
Businesses have to be prepared to apologize anyway. Checks bounce, the last book in the inventory turns out to be defective,
or the vendor drops the last crystal vase. We often have to make records and reality consistent.
Software State is not the State of the World
We have fostered a myth that the state of the software has to be always identical to the state of the world. This often makes
software applications difficult to use, or impossible to write. Deciding what the cost of getting it absolutely right is a
business decision. As Amazon and the airlines illustrate, the cost of lost business and convenience sometimes offsets the occasional problems of inconsistent data. You must then design for eventual consistency.
Scalability is based on the constraints of your application, the volume of data transmitted, or the number and geographic distribution of your users.
Need absolute consistency? Use the relational model. Need high availability? Use Azure tables, or the partitioned relational
model. Availability is a subjective measure. You might partition and still get consistency.
If the nature of your world changes, however, it is not easy to shift from the relational model to a partitioned model.