July 13, 2015
In this blog post, we consider the reasons for the previous failure of data federations, and then argue that a related construct, polystores, looks poised to have its “day in the sun”.
– Extract (parse the source data structure)
– Transform (for example, euros to dollars)
– Clean (-99 often means null)
– Integrate (for example, your wages is my salary)
– Deduplicate (for example, I am M. Stonebraker in one data source and Michael R. Stonebraker in another)
As a result, the market for data warehouse products is huge and the market for federation products is near zero. In my opinion, this is about to change. Two factors are at work here.
A small minority of this data is typical relational data; the rest is arrays, text, JSON, … Although it is possible to curate this data and then load it into a data warehouse, this tactic is not likely to work well because of the second factor at work.
2. One size does not fit all. There is no DBMS that offers high performance on all of the kinds of Mimic data. In round numbers, an RDBMS does fine on structured data, but is an order of magnitude or more away from the performance leader on the rest of the Mimic data. Similar statements can be made for other DBMS architectures.
Obviously, one does not want a user to learn these disparate query languages, so a data federation seems like a good idea to lower the number of disparate APIs one has to learn. We conjecture that as enterprises construct such broadly-scoped applications, they will realize the need for multiple kinds of DBMSs. A data federation is an obvious tactic to lower application complexity. We now continue with two tenets that such next generation federations should obey.
Hence, a next-generation data federation must support multiple query notations. This characteristic distinguishes them from previous federations, which were invariably based only on SQL. Such a federation must support a user-facing abstraction which we call an island of information, consisting of a query language, a data model, and shims to translate “island utterances” into the local dialect supported by each storage system. An island must produce the same answer to any given query, even though the data may reside in perhaps multiple local storage engines. Such location transparency must be preserved by the middleware supporting any particular island.
In summary, multiple location-transparent islands are required, each supporting a query language. A local storage engine might belong to multiple islands.
As a result of these two tenets, we expect future federations to have a multiplicity of location-transparent islands of information, with local storage engines in multiple islands. We define polystores to be systems obeying tenets 1 and 2, to distinguish them from previous data federations.
In the rest of this blog post, we briefly discuss research issues with polystores. Our approach to some of these topics in the Intel-sponsored BigDAWG project [2] is discussed in [3].
Query optimization. Previous federation work has assumed a cost-based model for a middleware optimizer. However, this requires the optimizer to know the characteristics of each operation in each local DBMS. This amount of required knowledge seems unreasonable, and requires the optimizer to change when a new engine joins an island. A “black box” approach makes a lot more sense when coping with disparate underlying engines.
Automatic shim construction. A local engine needs a shim for each island in which it participates. Work is required to make shim construction much less manual than it is today.
Automatic copy. Obviously full replication of all data in every underlying engine is infeasible. Hence, every engine in an island needs to be able to copy data to and from other engines in the island. Moreover, every island needs to be able to exchange data with other islands. Constructing high-speed copy systems semi-automatically is clearly desirable.
Distributed transactions. Constructing distributed transactions across local engines that may not support the same local transaction model is required. It is not clear how to do this.
Automatic load balancing and provisioning. It must be possible to balance the load across the local engines in a polystore by judiciously replicating or moving data objects. Obviously a monitoring system is required to support such a feature.
Novel user interfaces. The interactions with a polystore are not likely to conform to typical historical patterns. For example, a frequent query to a polystore will be “tell me something interesting”. Put differently, if I knew what query to ask, I would ask it, but I don’t. Polystores would be well served by new data mining, visualization, and browsing user interfaces.
In summary, we see a resurgence of interest in data federations, as applications get deployed on multiple local engines. Based on the two tenets above, we expect polystores to be the successful model for federation middleware. We also see a collection of challenging research issues with supporting polystores which obey tenets 1 and 2.
References
[1] Saeed, M., Villarroel, M., Reisner, A. T., Clifford, G., Lehman, L.-W., Moody, G., et al. (2011). Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): A public-access intensive care unit database. Critical Care Medicine, 39(5), 952–960.
[2] Elmore, A. J., Duggan, J., Stonebraker, M., Balazinska, M., Cetintemel, U., Gadepally, V., et al. (2015). A Demonstration of the BigDAWG Polystore System. Proceedings of the VLDB Endowment, 8(12).
[3] Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., et al. (2015). The BigDAWG Polystore System. ACM Sigmod Record, 44(3).
Michael Stonebraker is an adjunct professor at MIT CSAIL. He was the main architect of the INGRES relational DBMS; the object-relational DBMS, Postgres; and the federated data system, Mariposa; and has started nine start-up companies to commercialize these database technologies and, more recently, Big Data technologies (Vertica, VoltDB, Paradigm4, Tamr). He is the author of scores of papers in this area. He was recently elected to the National Academy of Engineering and the American Academy of Arts and Sciences. He has won the Association for Computing Machinery’s (ACM) A.M. Turing Award, often referred to as “the Nobel Prize of computing.”
Copyright @ 2015, Michael Stonebraker, All rights reserved.
Comments are closed
I think there are already some open source projects such as Apache Drill https://drill.apache.org/ – see here for more http://blog.matthewrathbone.com/2014/06/08/sql-engines-for-hadoop.html