Michael Stonebraker

The Case for Polystores

Databases
A Federated DBMS is a middleware offering that runs on top of (perhaps several) local DBMSs and presents a seamless interface to disparate systems with (perhaps) independently constructed DBMS schemas. Systems in this category include R*, Ingres*, Garlic, IBM’s Information Integrator, and several others. These offerings should be contrasted to parallel DBMSs, which are single DBMSs with partitioned and/or replicated tables and a single schema. Parallel DBMSs are a multi-billion dollar market, while federated DBMSs are a zero billion dollar market.

In this blog post, we consider the reasons for the previous failure of data federations, and then argue that a related construct, polystores, looks poised to have its “day in the sun”.

In the 1990’s, enterprises started assembling customer-facing data from disparate data sources into a common data warehouse, and used Extract, Transform, and Load (ETL) tools to facilitate data conversion and loading. Sources typically required extensive data curation, prior to loading. This included:

Extract (parse the source data structure)
Transform (for example, euros to dollars)
Clean (-99 often means null)
Integrate (for example, your wages is my salary)
Deduplicate (for example, I am M. Stonebraker in one data source and Michael R. Stonebraker in another)

Data curation is clearly an expensive proposition, and the fact that almost all enterprise data is quite dirty just makes this process more costly. By the time a skilled programmer had finished this task, he typically loaded the resulting data into a data warehouse. It rarely made sense for him to return curated data to the original source, so that a federation system could perform cross-system queries at a later time. After all, the curated data has been transformed and cleaned, so it is not compatible with the original source data. Therefore, it would only muck up whatever production system is interacting with the original data.

As a result, the market for data warehouse products is huge and the market for federation products is near zero. In my opinion, this is about to change. Two factors are at work here.

1. increased interest in disparate data. In the 1990’s enterprises were singularly focused on structured data. Today, there is interest in integrating structured data with text, web pages, semi-structured data, time series, etc. A case in point is the Mimic II data set [1], which encompasses 26,000 patient days in the intensive care unit (ICU) of Beth Israel Hospital in Boston. This data set includes:
– real time data (time series from bedside monitoring devices)
– a historical archive of waveform data (from previous patients)
– patient metadata (typical structured data)
– doctor’s and nurse’s notes (text)
– prescription information (semi-structured data)

A small minority of this data is typical relational data; the rest is arrays, text, JSON, … Although it is possible to curate this data and then load it into a data warehouse, this tactic is not likely to work well because of the second factor at work.

2. One size does not fit all. There is no DBMS that offers high performance on all of the kinds of Mimic data. In round numbers, an RDBMS does fine on structured data, but is an order of magnitude or more away from the performance leader on the rest of the Mimic data. Similar statements can be made for other DBMS architectures.

Hence, it makes sense to load the curated data into multiple DBMSs, for example the structured data into an RDBMS, the real time data into a stream processing engine, the historical archive into an array engine, the text into Lucene, and the semi-structured data into a JSON system.

Obviously, one does not want a user to learn these disparate query languages, so a data federation seems like a good idea to lower the number of disparate APIs one has to learn. We conjecture that as enterprises construct such broadly-scoped applications, they will realize the need for multiple kinds of DBMSs. A data federation is an obvious tactic to lower application complexity. We now continue with two tenets that such next generation federations should obey.

Tenet 1: There is no query language Esperanto

Users performing complex analytics on waveform data are often wedded to array query notations, such as R or ArrayQL. Text search users invariably want a keyword-oriented fuzzy matching system, and SQL is the preferred interface for structured data.

Hence, a next-generation data federation must support multiple query notations. This characteristic distinguishes them from previous federations, which were invariably based only on SQL. Such a federation must support a user-facing abstraction which we call an island of information, consisting of a query language, a data model, and shims to translate “island utterances” into the local dialect supported by each storage system. An island must produce the same answer to any given query, even though the data may reside in perhaps multiple local storage engines. Such location transparency must be preserved by the middleware supporting any particular island.

In summary, multiple location-transparent islands are required, each supporting a query language. A local storage engine might belong to multiple islands.

Tenet 2: Complete functionality of underlying DBMSs is required

It is appealing to construct each island with “intersection semantics”. In other words, construct an island query language with that set of utterances that can be successfully mapped into the underlying storage systems in a location transparent manner. Unfortunately, this will result in islands with less functionality than their underlying constituent systems. A second tenet of next generation federations is that one cannot give up local DBMS functionality. Put differently, existing applications must continue to run, when a storage engine joins a federation. This argues strongly for the inclusion of degenerate islands consisting of a single local engine and its query language.

As a result of these two tenets, we expect future federations to have a multiplicity of location-transparent islands of information, with local storage engines in multiple islands. We define polystores to be systems obeying tenets 1 and 2, to distinguish them from previous data federations.

In the rest of this blog post, we briefly discuss research issues with polystores. Our approach to some of these topics in the Intel-sponsored BigDAWG project [2] is discussed in [3].

Research issues with polystores

Query optimization. Previous federation work has assumed a cost-based model for a middleware optimizer. However, this requires the optimizer to know the characteristics of each operation in each local DBMS. This amount of required knowledge seems unreasonable, and requires the optimizer to change when a new engine joins an island. A “black box” approach makes a lot more sense when coping with disparate underlying engines.

Automatic shim construction. A local engine needs a shim for each island in which it participates. Work is required to make shim construction much less manual than it is today.

Automatic copy. Obviously full replication of all data in every underlying engine is infeasible. Hence, every engine in an island needs to be able to copy data to and from other engines in the island. Moreover, every island needs to be able to exchange data with other islands. Constructing high-speed copy systems semi-automatically is clearly desirable.

Distributed transactions. Constructing distributed transactions across local engines that may not support the same local transaction model is required. It is not clear how to do this.

Automatic load balancing and provisioning. It must be possible to balance the load across the local engines in a polystore by judiciously replicating or moving data objects. Obviously a monitoring system is required to support such a feature.

Novel user interfaces. The interactions with a polystore are not likely to conform to typical historical patterns. For example, a frequent query to a polystore will be “tell me something interesting”. Put differently, if I knew what query to ask, I would ask it, but I don’t. Polystores would be well served by new data mining, visualization, and browsing user interfaces.

In summary, we see a resurgence of interest in data federations, as applications get deployed on multiple local engines. Based on the two tenets above, we expect polystores to be the successful model for federation middleware. We also see a collection of challenging research issues with supporting polystores which obey tenets 1 and 2.

References

[1] Saeed, M., Villarroel, M., Reisner, A. T., Clifford, G., Lehman, L.-W., Moody, G., et al. (2011). Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): A public-access intensive care unit database. Critical Care Medicine, 39(5), 952–960.

[2] Elmore, A. J., Duggan, J., Stonebraker, M., Balazinska, M., Cetintemel, U., Gadepally, V., et al. (2015). A Demonstration of the BigDAWG Polystore System. Proceedings of the VLDB Endowment, 8(12).

[3] Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., et al. (2015). The BigDAWG Polystore System. ACM Sigmod Record, 44(3).

Blogger Profile:

Michael Stonebraker is an adjunct professor at MIT CSAIL. He was the main architect of the INGRES relational DBMS; the object-relational DBMS, Postgres; and the federated data system, Mariposa; and has started nine start-up companies to commercialize these database technologies and, more recently, Big Data technologies (Vertica, VoltDB, Paradigm4, Tamr). He is the author of scores of papers in this area. He was recently elected to the National Academy of Engineering and the American Academy of Arts and Sciences. He has won the Association for Computing Machinery’s (ACM) A.M. Turing Award, often referred to as “the Nobel Prize of computing.”

Copyright @ 2015, Michael Stonebraker, All rights reserved.

Facebooktwitterlinkedin
3,661 views

1 Comment

Comments are closed

Categories