April 27, 2017
Google has recently announced that its flagship wide-area database named Spanner has been made available on the Google Cloud. Google Spanner is the next generation globally-distributed database built inside Google and announced to the world through the paper published in OSDI 2012 [1]. This article explores the implication of Google Spanner, in particular to the NoSQL world.
The three important aspects of a distributed system include:
– Consistency: It is defined w.r.t. the memory being similar to atomic read-write memory – each operation works as if it has full control on the data item and operations are sequenced, one after the other. Any read operation that begins after a write operation completes, must return the value of that write operation or the value of a later write operation.
– Availability: It is defined as every request received by a non-failing node in the system must result in a response, i.e., every request eventually must be responded to. This can also be seen as a strong definition, as even in the face of network partitions, every request must eventually terminate or receive a response.
– Partition: When a network is partitioned, nodes are split into two groups and any message sent from one group to another is lost.
Brewer in his PODC 2000 talk conjectured that out of the three Consistency, Availability and Partition tolerance, only two are achievable in any distributed system. Seth Gilbert and Nancy Lynch formalized this notion and gave a theorem to the effect that in any asynchronous distributed system it is impossible to build an atomic read-write object that can guarantee both properties of availability and atomic consistency in the face of arbitrary failures a.k.a partitions [2].
The last few years have seen significant advances made by NoSQL databases such as Cassandra, MongoDB, HBase among others and their adoption across enterprises. The key features of NoSQL databases include schema flexibility, linear scalability, key-based sharding of data across a distributed set of nodes etc. They were primarily intended for high read-write ratio based workloads and were constrained by the CAP theorem – implying that under a network partition, the database can achieve only one out of availability or consistency. The NoSQL databases such as MongoDB, Redis, Google BigTable as well as HBase (and HDFS) sacrifice availability and ensure consistency in the face of partitions (CP systems). Cassandra, Riak and Voldermort were based on Amazon Dynamo and made the system available, sacrificing strict consistency (AP systems). They provide eventual consistency, a relaxed form of consistency compared to the stricter forms provided by relational databases.
Google Spanner, on the other hand, provides ACID consistency across a wide-area based distributed system. It provides a strict form of consistency known as Linearizability [3]. It is the first system to do so at global scale [1]. Spanner assigns global timestamps to transactions across a distributed set of nodes; timestamps reflect serialization order. The key to Spanner’s global timestamps are the TrueTime API and its implementation. The TrueTime API abstracts and exposes clock uncertainty and allows applications to reason with uncertainty, while the TrueTime API implementation in Google’s datacenters restricts the uncertainty to less than 10 milliseconds. The uncertainty is small compared to say NTP where the deltas between different clocks across a distributed system can be as high as 250 milliseconds. Google’s TrueTime API implementation has achieved that by having two physical clocks on each node: atomic and GPS.
The unit of data exposed is known as a tablet, in-line with BigTable terminology. A tablet stores time-stamped key-value pairs and is stored in B-tree like files and a write-ahead-log. The Spanserver, which is usually responsible for 100 to 1000 tablet instances, implements a Paxos state machine [4] on top of each tablet. Writes must initiate a Paxos protocol at the leader, while any replica that is sufficiently up to date can serve a read request. Spanservers also implement a transaction manager, which coordinate to perform two-phase commit for transactions that span across Paxos groups. A transaction that involves only a single Paxos group (the majority of transactions) do not need to go through the transaction manager. F1, an advertisement backend built by Google is the first “client” of Google Spanner and used it in production. In essence, Spanner provides the following features: semi-relational tables, query language based on SQL, and the notion of ACID transactions.
A question may arise in the mind of the reader whether Google Spanner overcomes the CAP theorem as it provides full ACID kind of consistency. This needs to be addressed. The CAP theorem is inviolable – must hold true absolutely. Google Spanner does not overcome the CAP theorem. It is a CA system from a user’s perspective, but technically under strict scrutiny, it is a CP system sacrificing availability in the face of a network partition, but always consistent. Why am I saying it is a CA system from the user’s perspective? This is because Google have ensured there are redundancies in network connections and power cables, and has taken enough care to reduce the chances of a full network partition. While partition is a rare event, under the regular operations, the system becomes quite close to CA system as it is strongly consistent and nearly available always (close to 5 nines availability) [5]. This implies users of the Google Spanner system may not perceive is unavailability, making it an effectively CA system in practice. In theory, however, under a network partition, Spanner sacrifices availability but remains strongly consistent, in line with the CAP theorem.
Several other distributed systems such as Kafka also implement Paxos algorithms indirectly by using Zookeeper. Zookeeper implements an atomic broadcast protocol (it has been shown that consensus and atomic broadcast are equivalent problems in distributed systems) known as Zookeeper Atomic Broadcast (ZAB) [6]. Many NoSQL datastores, including HDFS and Hive, use Paxos algorithms as they store metadata in a Zookeeper cluster.
Recent work in the NoSQL databases has focused on providing stricter consistency using algorithms such as Paxos (Cassandra) [7] or ZAB (HBase through Zookeeper). HBase is also moving to a modern consensus algorithm known as Raft [8]. It must be remembered that Paxos algorithms can guarantee only safety and not liveness. Moreover, Paxos algorithms may not be usable in a wide-area network due to performance limitations [9]. This is the reason most NoSQL databases may not provide consistency across a wide-area distributed system. This includes even CP systems that sacrifice availability in the face of partitions and remain strongly consistent.
Spanner limits the clock drift to less than 10 milliseconds by using atomic and GPS clocks at each node. This coupled with multiple redundancies in the Google infrastructure allows to limit the non-availability of the Spanner system to a value imperceptible for users, which is why Spanner is effectively CA system. This is evident from the fact that users of Spanner do not need to write code to handle such outages [5]. More importantly, the TrueTime API lets one reason about global time across sets of Paxos state machines. Whereas the Paxos and related algorithms implemented in the NoSQL databases do not support transactions across Paxos groups, Spanner has implemented two-phase commit (2PC) [10] across Paxos groups.
CockroachDB is the open source version of Google Spanner. It overcomes the limitation of Spanner (that it can only run in Google infrastructure and may result in vendor lock-in) and lets one install it even on non-Google private or public clouds. The main difference is that since CockroachDB does not have Google infrastructure to implement TrueTime API to synchronize the clocks across the distributed system, the consistency guarantee it provides is known as Serializability and not Linearizability (which Spanner provides). More information about CockroachDB’s synchronization protocol can be found in [11].
NuoDB is one of the early entrants to what is popularly called as the NewSQL databases [12]. NewSQL databases provide ACID transactions like traditional databases, allow linear scalability like the NoSQL datastores and keep the relational model of traditional databases. OLAP data warehouses like Vertica or Greenplum are not NewSQL, as they deal with read-only queries on OLAP workloads – one definition of NewSQL databases says they support lock-free concurrency control scheme and implement a transparent sharded middleware that abstracts a shared-nothing distributed architecture. The concurrency control scheme is either a multi-version concurrency control (MVCC) or a combination of MVC and two-phase locking (2PL). NuoDB is a classic elastic SQL database in that it provides ACID transactions and is horizontally scalable, which is the reason many companies use NuoDB in production. NuoDB and Google Spanner are similar in many respects including the elastic SQL support and ACID transactions. Another important similarity is the table partitioning scheme in NuoDB which is similar to the stored indices feature of Google Spanner that groups data that are likely to be used together. However, NuoDB does not have the concept of TrueTime. Though NuoDB can be configured to provide ACID transactions across a WAN, performance is likely to be significantly affected [12].
The NewSQL community have acknowledged that Google Spanner is the one with the most unique concurrency control/replica consistency scheme amongst the NewSQL databases. Clusterix, NuoDB, MemSQL and VoltDB are other prominent NewSQL databases.
We have deliberately avoided getting into the precise definitions of NoSQL datastores – it could be debated if Google Spanner itself is just another NoSQL datastore. We assume the Not-only-SQL movement, as it is widely known, captures the set of datastores that deal with schema flexibility, easy replication support, and are eventually consistent (BASE – Basically Available, Soft State and Eventually Consistent). The logical argument can be that is Google Spanner itself a NoSQL datastore or not? But we do not intend to go down this path. The key is to understand that Google Spanner sounds the death knell for the traditional NoSQL datastores and nothing beyond.
Google Spanner looks to provide ACID transactions across a wide-area distributed system by its unique TrueTime API and its reference implementation of the TrueTime API. Most NoSQL or even NewSQL systems cannot provide ACID like transactions over a wide-area distributed system. While Spanner is technically a CP system that sacrifices availability in the face of partitions to remain strongly consistent, the Google implementation ensures it is effectively CA system in practice. Moreover, Spanner has implemented 2PC algorithms across Paxos groups, allowing users to specify arbitrary transaction boundaries and being able to implement ACID properties across those. However, performance of the Spanner database under scale must be verified – this is a topic for further work. We shall be taking up implementation of TPCC benchmarks with Spanner and evaluating the same vis-à-vis other similar systems. One also needs to verify the SQL support that Google Spanner provides currently and to look at the SQL roadmap of Spanner. But the Google announcement of making Spanner available in its cloud environment must send shivers down the spine of people in the NoSQL database ecosystem.
It is interesting to note that certain distributed algorithms built on top of synchronized clocks can get significant performance benefits from Google Spanner. The algorithms could use the synchronized clocks to reduce communication (using timestamps to reduce message exchanges) or to reduce state information that is maintained (timestamps as a garbage collection windowing system). See the good paper by Barbara Liskov on practical uses of synchronized clocks in distributed systems [13].
Another interesting aspect that should be kept in mind while designing large distributed systems is that most existing algorithms, including Paxos and those used in Google Spanner, do not solve the Byzantine consensus problem [14]. Byzantine consensus is a formulation of the consensus problem with extreme behavior attributed to nodes, allowing reasoning about difficult real-world conditions such as software bugs. One may have to explore block chain [15] or related technologies to solve Byzantine consensus.
[1] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s Globally-Distributed Database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, USA, 251-264.
[2] Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (June 2002), 51-59. DOI: https://doi.org/10.1145/564585.564601.
[3] Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems 12, 3 (July 1990), 463-492.
[4] Leslie Lamport. 1998. The Part-time Parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133-169. DOI=http://dx.doi.org/10.1145/279227.279229
[5] Eric Brewer, “Spanner, TrueTime and the CAP Theorem”, available at https://research.google.com/pubs/pub45855.html
[6] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks (DSN ’11). IEEE Computer Society, Washington, DC, USA, 245-256.
[7] Butler Lampson. 2001. The ABCD’s of Paxos. In Proceedings of the twentieth annual ACM symposium on Principles of distributed computing (PODC ’01). ACM, New York, NY, USA, 13-. DOI=http://dx.doi.org/10.1145/383962.383969.
[8] Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC’14), Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, Berkeley, CA, USA, 305-320.
[9] A. Ailijiang, A. Charapko and M. Demirbas, “Consensus in the Cloud: Paxos Systems Demystified,” 2016 25th International Conference on Computer Communication and Networks (ICCCN), Waikoloa, HI, 2016, pp. 1-10. doi: 10.1109/ICCCN.2016.7568499.
[10] Butler W. Lampson. 1981. Atomic Transactions. In Distributed Systems – Architecture and Implementation, An Advanced Course, Butler W. Lampson, M. Paul, and H. J. Siegert (Eds.). Springer-Verlag, London, UK, UK, 246-265.
[11] Spencer Kimball, “Living Without Atomic Clocks”, CockroachDB Blog, available from: https://www.cockroachlabs.com/blog/living-without-atomic-clocks/.
[12] Andrew Pavlo and Matthew Aslett. 2016. What’s Really New with NewSQL?. SIGMOD Record, 45, 2 (September 2016), 45-55. DOI: http://dx.doi.org/10.1145/3003665.3003674
[13] Barbara Liskov. 1991. Practical uses of synchronized clocks in distributed systems. In Proceedings of the tenth annual ACM symposium on Principles of distributed computing (PODC ’91). ACM, New York, NY, USA, 1-9. DOI=http://dx.doi.org/10.1145/112600.112601
[14] Leslie Lamport, Robert Shostak, and Marshall Pease. 1982. The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4, 3 (July 1982), 382-401. DOI=http://dx.doi.org/10.1145/357172.357176.
[15] Crosby, M., Pattanayak, P., Verma, S., & Kalyanaraman, V. (2016). Blockchain technology: Beyond bitcoin. Applied Innovation, 2, 6-10.
[16] Seth Proctor, “Google Cloud Spanner & NuoDB: A Comparison of Distributed, ACID-Compliant Database”, available from https://www.nuodb.com/techblog/google-cloud-spanner-nuodb-comparison-distributed-acid-compliant-databases.
Dr. Vijay Srinivas Agneeswaran has a Bachelor’s degree in Computer Science & Engineering from SVCE, Madras University (1998), an MS (By Research) from IIT Madras in 2001, a PhD from IIT Madras (2008) and a post-doctoral research fellowship in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL). He has joined as Director of Technology in the data sciences team of SapientNitro. He has spent the last ten years creating intellectual property and building products in the big data area in Oracle, Cognizant and Impetus. He has built PMML support into Spark/Storm and realized several machine learning algorithms such as LDA, Random Forests over Spark. He led a team that designed and implemented a big data governance product for a role-based fine-grained access control inside of Hadoop YARN. He and his team have also built the first distributed deep learning framework on Spark. He is a professional member of the ACM and the IEEE (Senior) for the last 10+ years. He has four full US patents and has published in leading journals and conferences, including IEEE transactions. His research interests include distributed systems, data sciences as well as Big-Data and other emerging technologies. He has been an invited speaker in several national and International conferences such as O’Reilly’s Strata Big-data conference series. He will also be speaking at the Strata Big-data conference in London in May 2017. He lives in Bangalore with his wife, son and daughter and enjoys researching history and philosophy of Egypt, Babylonia, Greece and India.
Copyright @ 2017, Vijay Srinivas Agneeswaran, All rights reserved.
Comments are closed
Interesting Post, thanks for sharing.
I’m hoping TrueTime API will eventually get opensource 🙂
Thanks for your comments Jithendra. While it remains to be seen if TrueTime API will be opensourced, there is already something available in the open source: Cockroach DB. CockroachDB is interesting, as it provides ACID properties at scale similar to Google Spanner (it provides Serilizability and not Linearizability that is provided by Spanner).