Big Data Is Opening the Door To Revolutions: Databases Should Be Next

Big Data

Most professional fields, whether in business or academia, rely on data and have done so for centuries. In the digital age and with the emergence of Big Data, this dependency is growing dramatically – perhaps out of proportion to its current value given the concepts, tools, and techniques presently available. For example, how do you tell if the results of data-intensive analysis are correct and reliable and not weak or even spurious? Most data-intensive disciplines have statistical measures that attempt to calculate meaning or truth. Efficacy quantifies the strength of a relationship within a system, such as biology or business. For example, when researchers investigate a new drug, they compare its effectiveness to a placebo, using statistics to determine whether the drug worked. This approach, where data selection and processing is predicated on complex models rather than simple comparison, is a far cry from select-project-join queries.


Efficacy is the capacity to produce a desired result or effect. In medicine, it is the ability of an intervention or drug to produce an outcome. P-values have been a conventional empirical metric of efficacy for 100 years.


Moreover, the underlying data in these fields is complex, uncertain, and multimodal. Despite a large body of research data management for science applications, there has been little adoption of relational techniques in the science disciplines. In this post, we examine two challenges. First, modeling data around domain-specific efficacy rather than set theory. Second, support for ensembles of data models to enable many perspectives on a single data set.

The big picture is compelling. Since the late 1980’s one of us estimated in papers and keynotes that databases contain less than 10% of the world’s data and dropping fast as non-database data growth exploded. A corresponding fraction of the world’s applications – data and computation – are amenable to traditional databases. Modelling the 90% opens the door for the database community to the requirements of the rest of the world’s data and a new, vastly larger generation of database research and technology. This calls for a shift in our community commensurate with the profound changes introduced by Big Data.

Efficacy first, then efficiency

Since meaning and truth are relative to a system, efficacy measures are of accuracy, correctness, precision, and significance with respect to a context. That we can compute an answer efficiently – at lightning speed over massive data sets – is entirely irrelevant or even harmful if we cannot demonstrate that the answer is meaningful or at least approximately right in a given context. As fields develop and complexity increases, efficacy measures become increasingly sophisticated, refined, and debated. For example, p-values – the gold standard of empirical efficacy – have been questioned for decades, especially under the pressure of increasing irreproducibility in science. The same is true for precision and recall in information retrieval. In fact, since most fields that depend on data involve uncertainty, measures of efficacy are being questioned everywhere, with the notable exception of data management.

Big Data, broadly construed, is inherently multidisciplinary but often lacks the efficacy measures of its constituent disciplines – statistics, machine learning, empiricism, data mining, information retrieval, among others – let alone those of application domains such as finance, biology, clinical studies, high-energy physics, drug discovery, and astrophysics. One reason for this is that efficacy measures that have been developed in the small data world, based on statistics and other fields, do not necessarily hold true over massive data sets [3][5]. Efficacy in this context is an important, open, and rich research challenge. The value and success of data-intensive discovery (Big Data) depends on achieving adequate means of evaluating the efficacy of its results. A notable exception is the Baylor-Watson result [7] that focused first on efficacy, i.e., modeling, that then contributed to efficiency. But efficacy is one aspect of a larger challenge – modeling.


Relational data is the servant of the data model and the query. It was right to constrain data when we had a well-defined model. And we could always get the model right – right?


As data management evolved it distinguished itself from information retrieval by not requiring efficacy measures since databases were bounded, discrete, and complied with well-defined models, e.g., schemas. In contrast, information retrieval (and later machine learning) searched data sets for complex correlations rather than rigidly defined predicates. Finding relationships like “select all pairs where their covariance is greater than x” are inherently iterative and compute-intensive. In contrast, the contents of a database either match a query or they do not – black or white. No need for estimating accuracy, confidence, or probabilities. Relational data is the servant of the data model and the query. This permitted massive performance improvements that led to the widespread adoption of databases in applications for which schemas made sense. If the data did not comply with the schema and the query within that, then the data was erroneous by definition and should be rejected or corrected. It was right to constrain data when we had a well-defined model. And we could always get the model right – right? Many courageous researchers over the past 50 years have studied this problem, including probabilistic databases and fuzzy logic, (and more [1]) but none has seen widespread adoption. Why?

While the non-database world – life sciences, high-energy physics, astrophysics, finance – opened the door to Big Data and its possibilities, the data management world is aspiring to take ownership of their infrastructure – the storage, management, manipulation, querying, and searching of massive datasets. Currently much of this work is done in an ad-hoc manner using tools like R and Python. What is required for a more general solution? The non-database world is driven by applications – solving problems with real-world constraints – achieving efficacy within the models and definitions of their domain – often with 400 or 500 years of history.

In contrast, the database landscape is predominantly concerned with efficiency and has not dealt head on with efficacy yet. Some of these issues have been addressed in the database context in terms of specific models, languages, and design, but seldom have those concerns impacted the core database infrastructure, let alone gained adoption. Perhaps database researchers focused only on application domains that are well-behaved. While efficacy is a critical requirement – possibly the most critical requirement – in domains that make extensive use of data, it is part of the broader requirement for modelling unmet by database systems.


For more than a decade physics, astrophysics, photonics, biology, indeed most physical sciences as well as statistics and machine learning have made the modest assumption that multiple perspectives may be more valuable than a single model.


Data Models for the 90%

The database community, like many others, perhaps has not fully internalized the paradigm shift from small to big data. Big Data – or data per se – does not create the change nor is itself the change. Big Data opens the door to a revolution in thinking. One aspect involves data-driven methods. A profound shift involves viewing phenomena from multiple perspectives simultaneously.

A significant aspect of this shift is that every Big Data activity (small data activities also, but with less impact) requires measures of efficacy for each perspective or model. This is not simply owing to the reframing of corresponding principles from empirical science, but also to the multiple meanings of data, each of which requires mechanisms for addressing efficacy.

If the data management community is about to provide solutions for this nascent challenge, then it will need to deal with efficacy. This essentially has to do with modelling, a chaotic and ad-hoc database topic that has been largely unsuccessful, again measured by adoption. The relational model has dominated databases for over 40 years largely owing to efficiency. The database community knows how to optimize anything expressed relationally. While the relational model has proven to be amazingly general, its adoption has been limited in many domains, especially the sciences.

A related limitation of the database world is the assumption of a single perspective, e.g., a single version of truth, one schema per database even with multiple views. For more than a decade physics, astrophysics, photonics, biology, indeed most physical sciences as well as statistics and machine learning have made the modest assumption that multiple perspectives may be more valuable than a single model.

In [8] the author argued that science undergoes paradigm shifts only when there are rival theories about the fundamentals of a discipline. It is his position that rival paradigms are incommensurable using entirely different concepts and evaluation metrics from one another. One such example was the wave and particle theories of light. Each has entirely different models and measures of efficacy. Understanding the big picture necessitates finding consistencies and anomalies in both theories.

Ensemble models are one approach to addressing this challenge. Let’s consider an example in evolutionary biology where researchers use a collection of models to learn about how the human genome has changed over time. In [3] the authors identified positive examples of natural selection in recent human populations. Their discoveries have two parts: the affected gene’s location and its (improved) mutation. By composing many signals of natural selection, the authors increase the resolution of their genomic map by up to 100x. This research computes genetic signals at many levels, from clustering genes that are likely to be inherited together to looking at the high-level geographic distribution of different mutations. In present database modeling, the former might be represented as a graph database, whereas the latter is more likely to fall into the purview of geospatial databases. How can we bring them together? Perhaps neither of these models is designed for computing how effective different genetic variations are at producing advantageous traits. This pattern repeats itself in meteorology, physics, and a myriad of other domains that mathematically model large, dynamic systems.


Stepping into the void of uncertainty, unboundedness, ensemble models, and open-ended model exploration is far harder and scarier. We call it Computing Reality


Ensemble models pose substantial challenges to the data management community. How do you simultaneously store, manage, query, and update this variety of models, applying to a single dataset with many, possibly conflicting schemas? Database folks may first be concerned about doing this efficiently. Nope – wrong question. The first step is to understand the problem, to ask the right questions, to get the model correct and only then to make it efficient. How do you support ensemble models and their requirements including efficacy?

This may be why application domains that use massive data sets have grown their own data management tools, such as Hadoop, ADAM, Wikidata, and Scientific Data Management Systems, let alone a plethora of such tools in most physical science communities that the database community has never heard of. It’s not just that their data does not fit the relational model; databases do not support ensemble models, efficacy, or many of the fundamental concepts used to understand data. Why would any application domain (e.g., physical sciences, clinical studies, drug discovery) or discipline (e.g., information retrieval, machine learning, statistics) want to partner with an infrastructure technology that did not support its basic principles?

The database community has developed amazing technology that has changed the world. Since the early 1990’s it has extended its models to non-relational models such for networks, text, graphs, arrays, and many more. But efficacy is not just an issue of expressing eScience applications relationally, as UDFs or in R, but modeling and computing hypotheses under the complex contexts defined by domain experts, none of which map easily to set theory or other discrete mathematics. Stepping into the void of uncertainty, unboundedness, ensemble models, and open-ended model exploration is far harder and scarier. We call it Computing Reality [1][6].

Big Data is opening the door to a paradigm shift in many human endeavors. Machine learning was first through the door with real, albeit preliminary, results and it is already on to the next generation with deep learning [3]. Analytics and other domains are riding the wave of machine learning. The database community is heading for the door now, but it will be challenging. We first have to understand the problem and get the requirements right. To paraphrase Ron Fagin, we need to focus on asking the right questions. The rest may be a breeze but efficacy before efficiency!

So not only are we leaving the relational world that was dominated one model or a class of discrete models, but we are leaving the world of a single model for each dataset and embarking on a journey into a world of ensemble models of including probabilistic, fuzzy, and even potentially the richest model of them all, a model-free approach that enables us to listen to the data. All at scale. This seems scary to us but also just what we need.

Are we crazy, naive? Isn’t it our mission to dig in this data goldmine, to contribute to accelerating scientific discovery? What do you think? We are all ears.

References
[1] Brodie, Michael and Jennie Duggan. What versus Why. Towards Computing Reality, April 17, 2014, published April 22, 2014

[2] Duggan, Jennie and Michael L. Brodie, Hephaestus: Virtual Experiments for Data-Intensive Science, In CIDR 2015 (to appear)

[3] Gomes, Lee. Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Other Huge Engineering Efforts, IEEE Spectrum, 20 Oct 2014

[4] Grossman, Sharon R., et al. “A composite of multiple signals distinguishes causal variants in regions of positive selection.” Science 327.5967 (2010): 883-886.

[5] National Research Council. Frontiers in Massive Data Analysis. Washington, DC: The National Academies Press, 2013

[6] Piatetsky, Gregory, Interview: Michael Brodie, leading database researcher, industry leader, thinker. SIGKDD Explor. Newsl. 16, 1 (September 2014), 57-63. DOI=10.1145/2674026.2674035 http://doi.acm.org/10.1145/2674026.2674035

[7] Scott Spangler, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R. Pickering, Austin Comer, Jeffrey N. Myers, Ioana Stanoi, Linda Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence Donehower, Ying Chen, and Olivier Lichtarge. 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’14). ACM, New York, NY, USA, 1877-1886. DOI=10.1145/2623330.2623667 http://doi.acm.org/10.1145/2623330.2623667

[8] Kuhn, Thomas S. The structure of scientific revolutions. University of Chicago press, 2012.

Bloggers’ Profiles:

Dr. Brodie has over 40 years experience in research and industrial practice in databases, distributed systems, integration, artificial intelligence, and multi-disciplinary problem solving. He is concerned with the Big Picture aspects of information ecosystems including business, economic, social, application, and technical. Dr. Brodie is a Research Scientist, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology; advises startups; serves on Advisory Boards of national and international research organizations; and is an adjunct professor at the National University of Ireland, Galway. For over 20 years he served as Chief Scientist of IT, Verizon, a Fortune 20 company, responsible for advanced technologies, architectures, and methodologies for Information Technology strategies and for guiding industrial scale deployments of emergent technologies. His current research and applied interests include Big Data, Data Science, and data curation at scale and a related start up Tamr.com. He has served on several National Academy of Science committees. Dr. Brodie holds a PhD in Databases from the University of Toronto 
and a Doctor of Science (honoris causa) from the National University of Ireland.

Jennie Duggan is a postdoctoral associate at MIT CSAIL working with Michael Stonebraker and an adjunct assistant professor at Northwestern University. She received her Ph.D. from Brown University in 2012 under the supervision of Ugur Cetintemel. Her research interests include scientific data management, database workload modeling, and cloud computing. She is especially focused on making data-driven science more accessible and scalable.

Copyright @ 2014, Michael Brodie, Jennie Duggan, All rights reserved.

Facebooktwitterlinkedin
302 views

3 Comments

  • Haim Kilov on November 10, 2014

    So non-trivial modelling problems exist not only in complex phenomena of mind, life, and society (F.A. Hayek, “The theory of complex phenomena”), but also in other, perhaps related, domains of human endeavour. Thank you for emphasizing this.

    As to probabilistic models, note Neuman’s [1] statement that “chance is not an explanation but our way of conceptualizing ignorance” applied by him, for example, to “random” mutations and applied (earlier and independently) by Hayek [2] to the use of statistics “when we deliberately ignore or are ignorant of any structure into which [individual elements] are organized”. Such abuse of (quantitative) mathematics has been around not only in biology and medicine but also in economics and elsewhere, often with serious negative consequences. In particular, those who overemphasize data mining may want to carefully consider Neuman’s remark that extracting meaning from a sentence is not a simple task of identifying statistical regularities but a process of interpretation in context.

    1. Y. Neuman. Reviving the Living: Meaning Making in Living Systems. Elsevier, NY, 2008.
    2. F.A. Hayek. The theory of complex phenomena. In: The critical approach to science and philosophy (In honor of Karl R. Popper). (Ed. M. Bunge). The Free Press of Glencoe, 1964, pp. 332-349.

  • Jennie Duggan on November 11, 2014

    Hi Haim,

    You are right that we need to not rely too heavily on data mining. Discoveries are only meaningful within a context, and semantics like that are notoriously hard put into a model. Thank you for the references.

  • Stephen Osei-Akoto on November 26, 2014

    I like the comments pointing out that over reliance on data mining would not set the tone for scientific investigation even though it a factor we need to consider in defining variables. I’m also tempted to say that big Data or data mining concept from the industry and academic seems to differ based on what people are saying.

Comments are closed

Categories