This is Part 3 of the series on NoSQL solution. If you haven’t read Part 1 and Part 2, please read it first.
In this section I would like to analyze some of the theorems and concepts driving data architecture, how they have evolved and why understanding of these are essential if you are looking to design or architect a database solution. Also I would like to remove some misunderstandings that is common when talking about these concepts. So without further ado, let’s get going.
The three concepts/theorems that we will cover here are (drumrolls please J ): ACID, BASE and CAP
ACID properties is something that anyone coming from the transactional SQL world easily knows. But for the sake of this article, let us quickly summarize ACID properties.
Also I need to define this first in order to clarify some misconceptions later in this article.
ACID properties are associated with every SINGLE transactional processing.
Atomicity – Simply put it means “All or nothing”. If any part of the transaction fails, the entire transactions is rolled back and not saved.
Consistency – This means that the transaction preserves all the rules defined database rules like unique key, foreign key, not null constraints etc.
Integrity – This means each transaction is executed independently within a specific boundary without being impacted by any other concurrent operation.
Providing isolation is the main goals of Concurrency Control and concurrent transactions always appear serialized in terms of transaction execution.
Durability – Simply put this means “committed data will never be lost”. That is any successful transaction data survives any failures including crash, error etc.
BASE Theorem evolved as a result of the need to shift from consistency to that of availability.
Before explaining this theorem, here is something more interesting. If you have learned chemistry then you might have learned ACID vs BASE as it relates to an aqueous solution.
pH of 7.0 indicates its pure water while less than 7 means its acidic solution and greater than 7 indicates base (or alkali) solution. So basically they are two ends of any aqueous solution.
Now change the area of discussion to data architecture and here again ACID and BASE represents the opposite ends of the consistency-availability spectrum. Isn’t that interesting? J
Sorry now back to the topic – BASE means:
Basically Available – The system shall be available and provide response to any request but there is no guarantee that the data is latest or data is not corrupt.
Soft state – This means that system changes over time even when no input is provided (due to eventual consistency) – thus the system is in a soft state and
Eventually consistent – This means that systems will eventually become consistent across the network. No immediate consistency of data across partitions is provided.
Don’t beat your head too much trying to understand what each term in BASE means. The most important aspect is to understand that BASE paradigm opens up the design approach which prefers high availability at the cost of consistency and integrity across partitions. There are many use cases/scenarios where BASE model is perfectly suitable. In fact many distributed solutions including Cloud uses a mixture of BASE and ACID model as applicable.
CAP Theorem was developed by Eric Brewer in 2000 and applies to any distributed data system. But recently it’s much talked about with the advent of NoSQL systems. CAP stands for:
Consistency – All the nodes in the partition see the same data all the time. i.e., data is consistent across partitions.
Availability – The system shall be available and provide response to any request.
Partition Tolerance – This means that a single node failure does not cause the entire system to collapse. i.e., the system is tolerant to network partition failures.
The Consistency Confusion:
Now then, CAP is a totally different concept from ACID. ACID applies to transactions. CAP applies to wide area systems that are partitioned. Think about Face Book, Google etc.
Also there is a lot of confusion about the “C” – the consistency in ACID and CAP models. Please note that they are both totally different type of consistency even though they have the same name.
Consistency in ACID refers to consistency achieved due to DB constraints where Consistency in CAP refers to consistency across partitions or copies of DB.
The moment you require partition – CAP comes into play but if you don’t have partitions then don’t worry about CAP –just focus on building solid transactional systems!
It is important that we look a little deeper beyond the definition of CAP in this article. Why do you need CAP? Well, let me answer this by asking another question – If you are designing wide area data systems, then what would be your criteria? In a high latency network world, performance is an important factor right? If your data audience is distributed and performance is one of your criteria, then you need to partition your data and in comes CAP theorem. But here is the catch – you can always have only “two of the threes” in C-A-P as part of your design solution. i.e., either CA or CP or AP. Now we need to realize one more important point. In a partitioned wide area data solution, partition failure (and therefore partition tolerance) is not a choice that we have. There are going to be network failures and “P” becomes a default choice. Now we are left with C &A. In other words we are left with two choices: CP or AP. i.e., which one do you need Consistency or Availability? Here you might have a valid question- Can’t I have all three?. Well, realistically that is impossible. That’s why you have to select two. Let me give an example:
An user tries to perform an operation and it times out (network failure). How can you handle this situation?
Should you fetch the data from another partition and risk possible inconsistency in order to achieve Availability?
Or should you cancel the operation in order to preserve consistency and loose availability?
You can only do either of the above and not both. This is the classic case of why we need to select either C or A. There is no one solution here. For example Yahoo’s massively parallel and distributed database called PNUTS maintains master copy locally based on geographic location of user. This reduces latency and increases availability but risks inconsistency in case this partition fails. (remote copies maintained asynchronously might not have latest data). On the other hand, Facebook strategy is just the opposite- master copy is in one central location and local copies are maintained locally which could result in potentially old local data . However, when users update their pages, the update goes to the master copy directly as do all the user’s reads for a short time, despite higher latency. After 20 seconds, the user’s traffic reverts to the closer copy, which by that time should reflect the update. As a NoSQL architect, it is important to understand the design choices available to you before you architect a data solution and hence understanding of the various theorems becomes very important. One more important point – can we have systems comply with both CAP and ACID? Yes, but instead of looking at them as “CAP” versus “ACID” it is important to look at each of the 7 properties separately and see what is important and what is not and how you can design systems that can meet your important factors.
CAPping it off for the day! Talk to you soon with more on NoSQL.