Big Data, and its 4 Vs – volume, velocity, variety, and veracity – have been at the forefront of societal, scientific and engineering discourse. Arguably the most important 5th V, value, is not talked about as much. How can we make sure that our data is not just big, but also valuable? WebDB 2015, the premier workshop on Web and Databases, focuses on this important topic this year. To set the stage, we have interviewed several prominent members of the data management community, soliciting their opinions on how we can ensure that data is not just available in quantity, but also in quality.
We interviewed Serge Abiteboul (INRIA Saclay & ENS Cachan), Oren Etzioni (Allen Institute for Artificial Intelligence), Divesh Srivastava (AT&T Labs-Research) with Luna Dong (Google Inc.), and Gerhard Weikum (Max Planck Institute for Informatics). We asked them about their motivation for doing research in the area of data quality, their current work, and their view on the future of the field.
Serge Abiteboul is a Senior Researcher INRIA Saclay, and an affiliated professor at Ecole Normale Supérieure de Cachan. He obtained his Ph.D. from the University of Southern California, and a State Doctoral Thesis from the University of Paris-Sud. He was a Lecturer at the École Polytechnique and Visiting Professor at Stanford and Oxford University. He has been Chair Professor at Collège de France in 2011-12 and Francqui Chair Professor at Namur University in 2012-2013. He co-founded the company Xyleme in 2000. Serge Abiteboul has received the ACM SIGMOD Innovation Award in 1998, the EADS Award from the French Academy of sciences in 2007; the Milner Award from the Royal Society in 2013; and a European Research Council Fellowship (2008-2013). He became a member of the French Academy of Sciences in 2008, and a member the Academy of Europe in 2011. He is a member of the Conseil national du numérique. His research work focuses mainly on data, information and knowledge management, particularly on the Web.
What is your motivation for doing research on the value of Big Data?
My experience is that it is getting easier and easier to get data but if you are not careful all you get is garbage. So quality is extremely important, never over-valued and certainly relevant. For instance, with some students we crawled the French Web. If you crawl naively, it turns out that very rapidly all the URLs you try to load are wrong, meaning they do not correspond to real pages, or they return pages without real content. You need to use something such as PageRank to focus your resources on relevant pages.
So then what is your current work for finding the equivalent of “relevant pages” in Big Data?
I am working on personal information where very often, the difficulty is to get the proper knowledge and, for instance, align correctly entities from different sources. My long-term goal also working for instance with Amélie Marian is the construction of a Personal Knowledge Base that gathers all the knowledge someone can get about his/her life. For each one of us, such knowledge has enormous potential value, but for the moment it lives in different silos and we cannot get this value.
This line of work is not purely technical, but involves societal issues as well. We are living in a world where companies and governments have loads of data on us and we don’t even know what they have and how they are using it. Personal Information Management is an attempt to rebalance the situation, and make personal data more easily accessible to the individuals. I have a paper on Personal Information Management Systems, talking about just that, to appear in CACM (with Benjamin André and Daniel Kaplan).
And what is your view of the killer app of Big Data?
Relational databases was a big technical success in the 1970s-80s. Recommendation of information was a big one in the 1990s-2000s, from PageRank to social recommendation. After data, after information, the next big technical success is going to be “knowledge”, say in the 2010s-20s :). It is not an easy sell because knowledge management has often been disappointing – not delivering on its promises. By knowledge management, I mean systems capable of acquiring knowledge at a large scale, reasoning with this knowledge, exchanging knowledge in a distributed manner. I mean techniques such as that used at Berkeley with Bud or at INRIA with Webdamlog. To build such systems, beyond scale and distribution, we have to solve quality issues: the knowledge is going to be imprecise, possibly missing, with inconsistencies. I see knowledge management as the next killer app!
Oren Etzioni is Chief Executive Officer of the Allen Institute for Artificial Intelligence. He has been a Professor at the University of Washington’s Computer Science department starting in 1991, receiving several awards including GeekWire’s Geek of the Year (2013), the Robert Engelmore Memorial Award (2007), the IJCAI Distinguished Paper Award (2005), AAAI Fellow (2003), and a National Young Investigator Award (1993). He was also the founder or co-founder of several companies including Farecast (sold to Microsoft in 2008) and Decide (sold to eBay in 2013), and the author of over 100 technical papers that have garnered over 22,000 citations. The goal of Oren’s research is to solve fundamental problems in AI, particularly the automatic learning of knowledge from text. Oren received his Ph.D. from Carnegie Mellon University in 1991, and his B.A. from Harvard in 1986.
Oren, how did you get started in your work on Big Data?
I like to say that I’ve been working on Big Data from the early days when it was only “small data”. Our 2003 KDD paper on predictive pricing started with a data set with 12K data points. By the time Farecast was sold to Microsoft, in 2008, we were approaching a trillion labeled data points. Big price data was the essence of Farecast’s predictive model, and had the elegant property that it was “self labeling”. That is, if we can label the airfare on a flight from Seattle to Boston with either a “buy now” or “wait” label—all we have to do is monitor the price movement over time to determine the appropriate label. 20/20 hindsight allows us to produce labels automatically. But for Farecast, and other applications of Big Data, the labeled data points are only part of the story. Background knowledge, reasoning, and more sophisticated semantic models are necessary to take predictive accuracy to the next level.
So what is the AI2 working on to bring us to this next level?
Beginning in January 1, 2014 we launched the Allen Institute for AI, a research center dedicated to leveraging modern data mining, text mining, and more in order to make progress on fundamental AI questions, and to develop high-impact AI applications.
And thinking ahead, what would be the killer application that you have in mind for Big Data?
Ideas like “background knowledge” and “common-sense reasoning” are investigated in AI whereas Big Data and data mining has developed into its own vibrant community. Over the next 10 years, I see the potential for these communities to re-engage with the goal of producing methods that are still scalable, but require less manual engineering and “human intelligence” to work. The killer application would be a Big Data application that easily adapts to a new domain, and that doesn’t make egregious errors because it has “more intelligence”.
Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his Ph.D. from the University of Wisconsin, Madison, and his B.Tech from the Indian Institute of Technology, Bombay. He is an ACM fellow, on the board of trustees of the VLDB Endowment, the managing editor of the Proceedings of the VLDB Endowment (PVLDB), and an associate editor of the ACM Transactions on Database Systems. His research interests and publications span a variety of topics in data management.
Xin Luna Dong is a senior research scientist at Google. She works on enriching and cleaning knowledge for the Google Knowledge Graph. Her research interest includes data integration, data cleaning, and knowledge management. Prior to joining Google, she worked for AT&T Labs – Research and received her Ph.D. in Computer Science and Engineering at the University of Washington. She is the co-chair for WAIM’15 and has served as an area chair for SIGMOD’15, ICDE’13, and CIKM’11. She won the best-demo award in SIGMOD’05.
Divesh and Luna, you have been working on several aspects of Big Data Value. What attracts you to this topic?
Value, the 5th V of big data, is arguably the promise of the big data era. The choices of what data to collect and integrate, what analyses to perform, and what data-driven decisions to make, are driven by their perceived value – to society, to organizations, and to individuals. It is worth noting that while value and quality of big data may be correlated, they are conceptually different. For example, one can have high quality data about the names of all the countries in North America, but this list of names may not have much perceived value. In contrast, even relatively incomplete data about the shopping habits of people can be quite valuable to online advertisers.
It should not be surprising that early efforts to extract value from big data have focused on integrating and extracting knowledge from the low-hanging fruit of “head” data – data about popular entities, in the current world, from large sources. This is true both in industry (often heavily relying on manual curation) and in academia (often as underlying assumptions of the proposed integration techniques). However, focusing exclusively on head data leaves behind a considerable volume of “tail” data, including data about less popular entities, in less popular verticals, about non-current (historical) facts, from smaller sources, in languages other than English, and so on. While each data item in the “long tail” may provide only little value, the total value present in the long tail can be substantial, possibly even exceeding the total value that can be extracted solely from head data. This is akin to shops making a big profit from a large number of specialty items, each sold in a small quantity, in addition to the profit made by selling large quantities of a few popular items.
We issue a call to arms – “leave no valuable data behind” in our quest to extract significant value from big data.
What is your recent work in this quest?
Our work in this area focuses on the acquisition, integration, and knowledge extraction from big data. More recently, we have been considering a variety of ideas, including looking at collaboratively edited databases, news stories, and “local” information, where multiple perspectives and timeliness can be even more important than guaranteeing extremely high accuracy (e.g., 99% accuracy requirement for Google’s Knowledge Graph).
We started this body of work a few years ago with the Solomon project for data fusion, to make wise decisions about finding the truth when faced with conflicting information from multiple sources. We quickly realized the importance of copy detection between sources of structured data to solve this problem, and developed techniques that iteratively perform copy detection, source trustworthiness evaluation, and truth discovery. The Knowledge Vault (KV) project and the Sonya project naturally extend the Solomon project to address the challenge of web-scale data. They focus on knowledge fusion, finding truthfulness of extracted knowledge from web-scale data (see here), and building probabilistic knowledge bases, in the presence of source errors and extraction errors, with the latter dominating (see here). The Sonya project in addition measures knowledge-based trust, determining the trustworthiness of web sources based on the correctness of the facts they provide.
Big data often has a temporal dimension, reflecting the dynamic nature of the real-world, with evolving entities, relationships and stories. Over the years we have worked on many big data integration problems dealing with evolving data. For example, our work on temporal record linkage addressed the challenging problem of entity resolution over time, which has to deal with evolution of entities wherein their attribute values can change over time, as well as the possibility that different entities are more likely to share similar attribute values over time. We have also looked at quality issues in collaboratively edited databases, with some recent work on automatically identifying fine-grained controversies over time in Wikipedia articles (upcoming paper in ICDE 2015).
More recently, we have been working on the novel topic of data source management, which is of increasing interest because of the proliferation of a large number of data sources in almost every domain of interest. Our initial research on this topic involves assessing the evolving quality of data sources, and enabling the discovery of valuable sources to integrate before actually performing the integration (see here and here).
Finally, we make a shameless plug for our new book “Big Data Integration” that should be published very soon, which we hope will serve as a starting point for interested readers to pursue additional work on this exciting topic.
And where do you think will research head tomorrow?
In keeping with our theme of “no valuable data left behind”, we think that effectively collecting, integrating, and using tail data is a challenging research direction for the big data community. There are many interesting questions that need to be answered. How should one acquire, integrate, and extract knowledge on tail entities, and for tail verticals, when there may not be many data sources providing relevant data? How can one understand the quality and value of tail data sources? How can such sources be used without compromising on value, even if the data are not of extremely high quality? How does one integrate historical data, including entities that evolve over time, and enable the exploration of the history of web data sources? In addition to freshness, what additional metrics are relevant to capturing quality over time? How does one deal with sources that provide data about future events? How can one integrate data across multiple languages and cultures? Answering these challenging questions will keep our community busy for many years to come.
Gerhard Weikum is a scientific director at the Max Planck Institute for Informatics in Saarbruecken, Germany, where he is leading the department on databases and information systems. He co-authored a comprehensive textbook on transactional systems, received the VLDB 10-Year Award for his work on automatic DB tuning, and is one of the creators of the YAGO knowledge base. Gerhard is an ACM Fellow, a member of several scientific academies in Germany and Europe, and a recipient of a Google Focused Research Award, an ACM SIGMOD Contributions Award, and an ERC Synergy Grant.
What is your motivation for doing research in the area of Big Data Value?
Big Data is the New Oil! This often heard metaphor refers to the world’s most precious raw asset — of this century and of the previous century. However, raw oil does not power any engines or contribute to high-tech materials. Oil needs to be cleaned, refined, and put in an application context to gain its true value. The same holds for Big Data. The raw data itself does not hold any value, unless it is processed in analytical tasks from which humans or downstream applications can derive insight. Here is where data quality comes into play, and in a crucial role.
Some applications may do well with huge amounts of inaccurate or partly erroneous data, but truly mission-critical applications would often prefer less data of higher accuracy and correctness. This Veracity dimension of the data is widely underestimated. In many applications, the workflows for Big Data analytics include major efforts on data cleaning, to eliminate or correct spurious data. Often, a substantial amount of manual data curation is unavoidable and incurs a major cost fraction.
OK, I see. Then what is your recent work in the area of oil refinery?
Much of the research in my group at the Max Planck Institute could actually be cast under this alternative – and much cooler – metaphor: Big Text Data is the New Chocolate!
We believe that many applications would enormously gain from tapping unstructured text data, like news, product reviews in social media, discussion forums, customer requests, and more. Chocolate is a lot more sophisticated and tasteful than oil — and so is natural-language text. Text is full of finesse, vagueness and ambiguities, and so could at best be seen as Uncertain Big Data. A major goal of our research is to automatically understand and enrich text data in terms of entities and relationships and this way enable its use in analytic tasks — on par with structured big data.
We have developed versatile and robust methods for discovering mentions of named entities in text documents, like news articles or posts in social media, and disambiguating them onto entities in a knowledge base or entity catalog. The AIDA software is freely available as open source code. These methods allow us to group documents by entities, entity pairs or entity categories, and compute aggregates on these groups. Our STICS demonstrator shows some of the capabilities for semantic search and analytics. We can further combine this with the detection and canonicalization of text phrases that denote relations between entities, and we can capture special kinds of text expressions that bear sentiments (like/dislike/support/oppose/doubt/etc.) or other important information.
Having nailed down the entities, we can obtain additional structured data from entity-indexed data and knowledge bases to further enrich our text documents and document groups. All this enables a wealth of co-occurrence-based analytics for comparisons, trends, outliers, and more. Obviously, for lifting unstructured text data to this value-added level, the Veracity of mapping names and phrases into entities and relations is decisive.
For example, when performing a political opinion analysis about the Ukrainian politician and former boxer Klitschko, one needs to be careful about not confusing him with his brother who is actively boxing. A news text like “former box champion Klitschko is now the mayor of Kiev” needs to be distinguished from “box champion Klitschko visited his brother in Kiev”. Conversely, a recent text like “the mayor of Kiev met with the German chancellor” should also count towards the politician Vitali Klitschko although it does not explicitly mention his name.
Why New Chocolate?
Well, in the Aztec Empire, cocoa beans were so valuable that they were used as currency! Moreover, cocoa contains chemicals that trigger the production of the neurotransmitter Serotonin in our brains – a happiness substance! Yes, you may have to eat thousands of chocolate bars before you experience any notable kicks, but for the sake of the principle: chocolate is so much richer and creates so much more happiness than oil.
Thanks for this culinary encouragement What do you think will be the future of the field?
Data quality is more than the quest for Veracity. Even if we could ensure that the database has only fully correct and accurate data points, there are other quality dimensions that often create major problems: incompleteness, bias and staleness are three aspects of paramount importance.
No data or knowledge base can ever be perfectly complete, but how do we know which parts we do not know? For a simple example, consider an entertainment music database, where we have captured a song and ten different cover versions of it. How can we tell that there really are only ten covers of that song? If there are more, how can we rule out that our choice of having these ten in the database is not biased in any way – for example, reflecting only Western culture versions and ignoring Asian covers? Tapping into text sources, in the New Chocolate sense, can help completing the data, but is also prone to “reporting bias”. The possibility that some of the data is stale, to different degrees, makes the situation even more complex.
Finally, add the Variety dimension on top of all this — not a single database but many independent data and text sources with different levels of incompleteness, bias, and staleness. Assessing the overall quality that such heterogeneous and diverse data provides for a given analytic task is a grand challenge. Ideally, we would like to understand how the quality of that data affects the quality of the insight we derive from it. If we consider data cleaning measures, what costs do we need to pay to achieve which improvements in data quality and analytic-output quality? I believe these are pressing research issues; their complexity will keep the field busy for the coming years.
Do you agree with Serge Abiteboul that knowledge management will be the killer app of Big Data? Do you share Gerhard Weikum’s opinion that Big Text Data is the New Chocolate? Do you have ideas on how to achieve the “no valuable data left behind” mantra that Divesh Srivastava and Luna Dong evoke? Does your work marry the domains of AI and Big Data, as Oren Etzioni proposes? We would be delighted to hear your opinion and your latest contribution in this field! This year’s WebDB workshop, which will be co-located with ACM SIGMOD, will provide a premier venue for discussing issues of big data quality. Its theme is “Freshness, Correctness, Quality of Information and Knowledge on the Web”. This theme encompasses a wide range of research directions, from focused crawling and time-aware search and ranking, to information extraction and data integration, to management, alignment, curation, and integration of structured knowledge, and to information corroboration and provenance. However, papers on all aspects of the Web and databases are solicited. We are looking forward to your submissions, and to interesting discussions about whether the future will bring us not just big data, but also good data!
Blogger Profile: Julia Stoyanovich is an Assistant Professor of Computer Science at Drexel University. She was previously a postdoctoral researcher and a CIFellow at the University of Pennsylvania. Julia holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst. After receiving her B.S. Julia went on to work for two start-ups and one real company in New York City, where she interacted with, and was puzzled by, a variety of massive datasets. Julia’s research focuses on developing novel information discovery approaches for large datasets in presence of rich semantic and statistical structure. Her work has been supported by the NSF and by Google.
Blogger Profile: Fabian M. Suchanek is an associate professor at the Telecom ParisTech University in Paris. Fabian developed inter alia the YAGO-Ontology, one of the largest public knowledge bases on the Semantic Web, which earned him a honorable mention of the SIGMOD dissertation award. His interests include information extraction, automated reasoning, and knowledge bases. Fabian has published around 40 scientific articles, among others at ISWC, VLDB, SIGMOD, WWW, CIKM, ICDE, and SIGIR, and his work has been cited more than 3500 times.
Archive for Big Data
Most professional fields, whether in business or academia, rely on data and have done so for centuries. In the digital age and with the emergence of Big Data, this dependency is growing dramatically – perhaps out of proportion to its current value given the concepts, tools, and techniques presently available. For example, how do you tell if the results of data-intensive analysis are correct and reliable and not weak or even spurious? Most data-intensive disciplines have statistical measures that attempt to calculate meaning or truth. Efficacy quantifies the strength of a relationship within a system, such as biology or business. For example, when researchers investigate a new drug, they compare its effectiveness to a placebo, using statistics to determine whether the drug worked. This approach, where data selection and processing is predicated on complex models rather than simple comparison, is a far cry from select-project-join queries.
Efficacy is the capacity to produce a desired result or effect. In medicine, it is the ability of an intervention or drug to produce an outcome. P-values have been a conventional empirical metric of efficacy for 100 years.
Moreover, the underlying data in these fields is complex, uncertain, and multimodal. Despite a large body of research data management for science applications, there has been little adoption of relational techniques in the science disciplines. In this post, we examine two challenges. First, modeling data around domain-specific efficacy rather than set theory. Second, support for ensembles of data models to enable many perspectives on a single data set.
The big picture is compelling. Since the late 1980’s one of us estimated in papers and keynotes that databases contain less than 10% of the world’s data and dropping fast as non-database data growth exploded. A corresponding fraction of the world’s applications – data and computation – are amenable to traditional databases. Modelling the 90% opens the door for the database community to the requirements of the rest of the world’s data and a new, vastly larger generation of database research and technology. This calls for a shift in our community commensurate with the profound changes introduced by Big Data.
Efficacy first, then efficiency
Since meaning and truth are relative to a system, efficacy measures are of accuracy, correctness, precision, and significance with respect to a context. That we can compute an answer efficiently – at lightning speed over massive data sets – is entirely irrelevant or even harmful if we cannot demonstrate that the answer is meaningful or at least approximately right in a given context. As fields develop and complexity increases, efficacy measures become increasingly sophisticated, refined, and debated. For example, p-values – the gold standard of empirical efficacy – have been questioned for decades, especially under the pressure of increasing irreproducibility in science. The same is true for precision and recall in information retrieval. In fact, since most fields that depend on data involve uncertainty, measures of efficacy are being questioned everywhere, with the notable exception of data management.
Big Data, broadly construed, is inherently multidisciplinary but often lacks the efficacy measures of its constituent disciplines – statistics, machine learning, empiricism, data mining, information retrieval, among others – let alone those of application domains such as finance, biology, clinical studies, high-energy physics, drug discovery, and astrophysics. One reason for this is that efficacy measures that have been developed in the small data world, based on statistics and other fields, do not necessarily hold true over massive data sets . Efficacy in this context is an important, open, and rich research challenge. The value and success of data-intensive discovery (Big Data) depends on achieving adequate means of evaluating the efficacy of its results. A notable exception is the Baylor-Watson result  that focused first on efficacy, i.e., modeling, that then contributed to efficiency. But efficacy is one aspect of a larger challenge – modeling.
Relational data is the servant of the data model and the query. It was right to constrain data when we had a well-defined model. And we could always get the model right – right?
As data management evolved it distinguished itself from information retrieval by not requiring efficacy measures since databases were bounded, discrete, and complied with well-defined models, e.g., schemas. In contrast, information retrieval (and later machine learning) searched data sets for complex correlations rather than rigidly defined predicates. Finding relationships like “select all pairs where their covariance is greater than x” are inherently iterative and compute-intensive. In contrast, the contents of a database either match a query or they do not – black or white. No need for estimating accuracy, confidence, or probabilities. Relational data is the servant of the data model and the query. This permitted massive performance improvements that led to the widespread adoption of databases in applications for which schemas made sense. If the data did not comply with the schema and the query within that, then the data was erroneous by definition and should be rejected or corrected. It was right to constrain data when we had a well-defined model. And we could always get the model right – right? Many courageous researchers over the past 50 years have studied this problem, including probabilistic databases and fuzzy logic, (and more ) but none has seen widespread adoption. Why?
While the non-database world – life sciences, high-energy physics, astrophysics, finance – opened the door to Big Data and its possibilities, the data management world is aspiring to take ownership of their infrastructure – the storage, management, manipulation, querying, and searching of massive datasets. Currently much of this work is done in an ad-hoc manner using tools like R and Python. What is required for a more general solution? The non-database world is driven by applications – solving problems with real-world constraints – achieving efficacy within the models and definitions of their domain – often with 400 or 500 years of history.
In contrast, the database landscape is predominantly concerned with efficiency and has not dealt head on with efficacy yet. Some of these issues have been addressed in the database context in terms of specific models, languages, and design, but seldom have those concerns impacted the core database infrastructure, let alone gained adoption. Perhaps database researchers focused only on application domains that are well-behaved. While efficacy is a critical requirement – possibly the most critical requirement – in domains that make extensive use of data, it is part of the broader requirement for modelling unmet by database systems.
For more than a decade physics, astrophysics, photonics, biology, indeed most physical sciences as well as statistics and machine learning have made the modest assumption that multiple perspectives may be more valuable than a single model.
Data Models for the 90%
The database community, like many others, perhaps has not fully internalized the paradigm shift from small to big data. Big Data – or data per se – does not create the change nor is itself the change. Big Data opens the door to a revolution in thinking. One aspect involves data-driven methods. A profound shift involves viewing phenomena from multiple perspectives simultaneously.
A significant aspect of this shift is that every Big Data activity (small data activities also, but with less impact) requires measures of efficacy for each perspective or model. This is not simply owing to the reframing of corresponding principles from empirical science, but also to the multiple meanings of data, each of which requires mechanisms for addressing efficacy.
If the data management community is about to provide solutions for this nascent challenge, then it will need to deal with efficacy. This essentially has to do with modelling, a chaotic and ad-hoc database topic that has been largely unsuccessful, again measured by adoption. The relational model has dominated databases for over 40 years largely owing to efficiency. The database community knows how to optimize anything expressed relationally. While the relational model has proven to be amazingly general, its adoption has been limited in many domains, especially the sciences.
A related limitation of the database world is the assumption of a single perspective, e.g., a single version of truth, one schema per database even with multiple views. For more than a decade physics, astrophysics, photonics, biology, indeed most physical sciences as well as statistics and machine learning have made the modest assumption that multiple perspectives may be more valuable than a single model.
In  the author argued that science undergoes paradigm shifts only when there are rival theories about the fundamentals of a discipline. It is his position that rival paradigms are incommensurable using entirely different concepts and evaluation metrics from one another. One such example was the wave and particle theories of light. Each has entirely different models and measures of efficacy. Understanding the big picture necessitates finding consistencies and anomalies in both theories.
Ensemble models are one approach to addressing this challenge. Let’s consider an example in evolutionary biology where researchers use a collection of models to learn about how the human genome has changed over time. In  the authors identified positive examples of natural selection in recent human populations. Their discoveries have two parts: the affected gene’s location and its (improved) mutation. By composing many signals of natural selection, the authors increase the resolution of their genomic map by up to 100x. This research computes genetic signals at many levels, from clustering genes that are likely to be inherited together to looking at the high-level geographic distribution of different mutations. In present database modeling, the former might be represented as a graph database, whereas the latter is more likely to fall into the purview of geospatial databases. How can we bring them together? Perhaps neither of these models is designed for computing how effective different genetic variations are at producing advantageous traits. This pattern repeats itself in meteorology, physics, and a myriad of other domains that mathematically model large, dynamic systems.
Stepping into the void of uncertainty, unboundedness, ensemble models, and open-ended model exploration is far harder and scarier. We call it Computing Reality
Ensemble models pose substantial challenges to the data management community. How do you simultaneously store, manage, query, and update this variety of models, applying to a single dataset with many, possibly conflicting schemas? Database folks may first be concerned about doing this efficiently. Nope – wrong question. The first step is to understand the problem, to ask the right questions, to get the model correct and only then to make it efficient. How do you support ensemble models and their requirements including efficacy?
This may be why application domains that use massive data sets have grown their own data management tools, such as Hadoop, ADAM, Wikidata, and Scientific Data Management Systems, let alone a plethora of such tools in most physical science communities that the database community has never heard of. It’s not just that their data does not fit the relational model; databases do not support ensemble models, efficacy, or many of the fundamental concepts used to understand data. Why would any application domain (e.g., physical sciences, clinical studies, drug discovery) or discipline (e.g., information retrieval, machine learning, statistics) want to partner with an infrastructure technology that did not support its basic principles?
The database community has developed amazing technology that has changed the world. Since the early 1990’s it has extended its models to non-relational models such for networks, text, graphs, arrays, and many more. But efficacy is not just an issue of expressing eScience applications relationally, as UDFs or in R, but modeling and computing hypotheses under the complex contexts defined by domain experts, none of which map easily to set theory or other discrete mathematics. Stepping into the void of uncertainty, unboundedness, ensemble models, and open-ended model exploration is far harder and scarier. We call it Computing Reality .
Big Data is opening the door to a paradigm shift in many human endeavors. Machine learning was first through the door with real, albeit preliminary, results and it is already on to the next generation with deep learning . Analytics and other domains are riding the wave of machine learning. The database community is heading for the door now, but it will be challenging. We first have to understand the problem and get the requirements right. To paraphrase Ron Fagin, we need to focus on asking the right questions. The rest may be a breeze but efficacy before efficiency!
So not only are we leaving the relational world that was dominated one model or a class of discrete models, but we are leaving the world of a single model for each dataset and embarking on a journey into a world of ensemble models of including probabilistic, fuzzy, and even potentially the richest model of them all, a model-free approach that enables us to listen to the data. All at scale. This seems scary to us but also just what we need.
Are we crazy, naive? Isn’t it our mission to dig in this data goldmine, to contribute to accelerating scientific discovery? What do you think? We are all ears.
 Duggan, Jennie and Michael L. Brodie, Hephaestus: Virtual Experiments for Data-Intensive Science, In CIDR 2015 (to appear)
 Gomes, Lee. Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Other Huge Engineering Efforts, IEEE Spectrum, 20 Oct 2014
 Grossman, Sharon R., et al. “A composite of multiple signals distinguishes causal variants in regions of positive selection.” Science 327.5967 (2010): 883-886.
 National Research Council. Frontiers in Massive Data Analysis. Washington, DC: The National Academies Press, 2013
 Scott Spangler, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R. Pickering, Austin Comer, Jeffrey N. Myers, Ioana Stanoi, Linda Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence Donehower, Ying Chen, and Olivier Lichtarge. 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’14). ACM, New York, NY, USA, 1877-1886. DOI=10.1145/2623330.2623667 http://doi.acm.org/10.1145/2623330.2623667
 Kuhn, Thomas S. The structure of scientific revolutions. University of Chicago press, 2012.
| Bloggers’ Profiles:
Dr. Brodie has over 40 years experience in research and industrial practice in databases, distributed systems, integration, artificial intelligence, and multi-disciplinary problem solving. He is concerned with the Big Picture aspects of information ecosystems including business, economic, social, application, and technical. Dr. Brodie is a Research Scientist, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology; advises startups; serves on Advisory Boards of national and international research organizations; and is an adjunct professor at the National University of Ireland, Galway. For over 20 years he served as Chief Scientist of IT, Verizon, a Fortune 20 company, responsible for advanced technologies, architectures, and methodologies for Information Technology strategies and for guiding industrial scale deployments of emergent technologies. His current research and applied interests include Big Data, Data Science, and data curation at scale and a related start up Tamr.com. He has served on several National Academy of Science committees. Dr. Brodie holds a PhD in Databases from the University of Toronto and a Doctor of Science (honoris causa) from the National University of Ireland.
Jennie Duggan is a postdoctoral associate at MIT CSAIL working with Michael Stonebraker and an adjunct assistant professor at Northwestern University. She received her Ph.D. from Brown University in 2012 under the supervision of Ugur Cetintemel. Her research interests include scientific data management, database workload modeling, and cloud computing. She is especially focused on making data-driven science more accessible and scalable.
“Both theoretical and empirical research may be unnecessarily complicated by failure to recognize the effects of heterogeneity” – Vaupel & Yashin
Big Data is daily topic of conversation among data analysts, with much said and written about its promises and pitfalls. The issue of heterogeneity, however, has received scant attention. This is unfortunate, since failing to take heterogeneity into account can easily derail the discoveries one makes using these data.
This issue, which some may recognize as an example of ecological fallacy, first came to my attention via a paper elegantly titled “Heterogeneity’s ruses: some Surprising Effects of Selection on Population Dynamics” (Vaupel and Yashin, 1985). Authors discuss a variety of examples where the aggregated behavior of a heterogeneous population, composed of two homogeneous but differently behaving subpopulations, will differ from the behavior of any single individual. Consider the following example. It has been observed that the recidivism rate of convicts released from prison declines with time. A natural conclusion one may reach from this observation is that former convicts are less likely to commit crime as they age. However, this is false. In reality, there may be two groups of individuals “reformed” and “incorrigible” with constant – but different – recidivism rates. With time, there will be more “reformed” individuals left in the population, as the “incorrigibles” are sent back to prison, resulting in decreasing recidivism rate for the population as a whole. This simple example shows that “the patterns observed [at population level] may be surprisingly different from the underlying patterns on the individual level. Researchers interested in uncovering these individual patterns, perhaps to help develop or test theories or to make predictions, might benefit from an “understanding of heterogeneity’s ruses.” (Vaupel & Yashin)
My colleagues and I have been tricked by heterogeneity time and again. As one example, our study of information spread on the follower graphs of Twitter and Digg revealed that it was surprisingly different from the simple epidemics that are often used to model information spread. In a simple epidemic, described, for example, as an independent cascade model, the probability of infection increases monotonically with the number of exposures to infected friends. This probability is measured by the exposure response function. The figure below shows the exposure response function we measured on Twitter: the probability for becoming infected (i.e., retweet) information (a URL) on Twitter as a function of how many friends had previously tweeted this information. In contrast to epidemics, it appears as though repeated exposure to information suppresses infection probability. We measured an even more pronounced suppression of infection on Digg [Ver Steeg et al, 2011], and a similar exposure response was observed for hashtag adoption following friends’ use of them [Romero et al, 2011].
It is easy to draw wrong conclusions from this finding. In “What stops social epidemics?” [Ver Steeg et al, 2011], we reported that information spread on Digg is quickly extinguished, and attributed this to the exposure response function. We speculated that initial exposures “inoculate” users to information, so that they will not become infected (i.e., propagate it) despite multiple exposures. Now we know this explanation was completely wrong.
The exposure response function, while aggregated over all users, does not describe the behavior of any individual Digg or Twitter user – even the hypothetical “typical” user. In fact, there is no “typical” Twitter (or Digg) user. Twitter users are extremely heterogeneous. Separating them into more homogeneous sub-populations reveals a more regular pattern. Figure 2 shows the exposure response function for different populations of Twitter users, separated according to the number of friends they follow (large fluctuations are the result of small sample size). Why number of friends? This is explained in more detail in our papers [Hodas & Lerman 2012, 2013], but in short, we found it useful to separate users according to their cognitive load, i.e., the volume of information they receive, which is (on average) proportional to the number of friends they follow [Hodas et al, 2013]. Now, the probability that a user within each population will become infected increases monotonically with the number of infected, very similar to the predictions of the independent cascade model.
Figure 2 has a different, more significant interpretation, with consequences for information diffusion. It suggests that highly connected users, i.e., those who follow many others, are less susceptible to becoming infected. Their decreased susceptibility in fact explains Figure 1: as one moves to the right of the exposure response curve, only the better connected, and less sensitive, users contribute to that portion of the response. However, despite their reduced susceptibility, highly connected users respond positively to repeated exposures, like all other users. You do not inhibit response by repeatedly exposing people to information. Instead, the reason that these users are less susceptible hinges on the human brain’s limited bandwidth. There are only so many tweets any one can read, the more tweets you receive (on average proportional to the number of friends you follow), the less likely you are to see – and retweet – any specific tweet. If it was not for recognizing heterogeneity, we would not have found this far more interesting explanation.
| Blogger’s Profile:
Kristina Lerman is a Project Leader at the University of Southern California Information Sciences Institute and holds a joint appointment as a Research Associate Professor in the USC Computer Science Department. After a brief stint as a theoretical roboticist, she found her calling in blending together methods from physics, computer science and social science to address problems in social computing and social media analysis. She writes many papers that are greatly enjoyed by all of their twenty readers.
Big Data should be Interesting Data!
There are various definitions of Big Data; most center around a number of V’s like volume, velocity, variety, veracity – in short: interesting data (interesting in at least one aspect). However, when you look into research papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see here in experimental studies is utterly boring. Performance and scalability experiments are often based on the TPC-H benchmark: completely synthetic data with a synthetic workload that has been beaten to death for the last twenty years. Data quality, data cleaning, and data integration studies are often based on bibliographic data from DBLP, usually old versions with less than a million publications, prolific authors, and curated records. I doubt that this is a real challenge for tasks like entity linkage or data cleaning. So where’s the – interesting – data in Big Data research?
Surely, companies have their own interesting data, and industrial labs have access to such data and real-life workloads. However, all this is proprietary and out of reach for academic research. Therefore, many researchers resort to the good old TPC-H benchmark and DBLP records and call it Big Data. Insights from TPC-H and DBLP are, however, usually not generalizable to interesting and truly challenging data and workloads. Yes, there are positive exceptions; I just refer to a general trend.
Looking Across the Fence: Experimental Data in other Research Communities
Now that I got you alerted, let me be constructive. I have also worked in research communities other than database systems: information retrieval, Web and Semantic Web, knowledge management (yes, a bit of AI), and recently also computational linguistics (aka. NLP). These communities have a different mindset towards data resources and their use in experimental work. To them, data resources like Web corpora, annotated texts, or inter-linked knowledge bases are vital assets for conducting experiments and measuring the progress in the field. These are not static benchmarks that are defined once every ten years; rather, relevant resources are continuously crafted and their role in experiments is continuously re-thought. For example, the IR community has new experimental tasks and competitions in the TREC, INEX, and CLEF conferences each year. Computational linguistics has an established culture of including the availability of data resources and experimental data (such as detailed ground-truth annotations) in the evaluation of submissions to their top conferences like ACL, EMNLP, CoNLL, and LREC. Review forms capture this aspect as an important dimension for all papers, not just for a handful of specific papers tagged Experiments & Analyses.
Even the Semantic Web community has successfully created a huge dataset for experiments: the Web of Linked Data consisting of more than 30 Billion RDF triples from hundreds of data sources with entity-level sameAs linkage across sources. What an irony: ten years ago we database folks thought of Semantic Web people as living in the ivory tower, and now they have more data to play with than we (academic database folks) can dream of.
Towards a Culture Shift in Our Community
Does our community lack the creativity and agility that other communities exhibit? I don’t think so. Rather I believe the problem lies in our publication and experimental culture. Aspects of this topic were discussed in earlier posts on the SIGMOD blog, but I want to address a new angle. We have over-emphasized publications as an achievement by itself: our community’s currency is the paper count rather than the intellectual insight and re-usable contribution. Making re-usable software available is appreciated, but it’s a small point in the academic value system when it comes to hiring, tenure, or promotion decisions. Contributing data resources plays an even smaller role. We need to change this situation by rewarding work on interesting data resources (and equally on open-source software): compiling the data, making it available to the community, and using it in experiments.
There are plenty of good starting points. The Web of Linked Data, with general-purpose knowledge bases (DBpedia, Freebase, Yago) and a wealth of thematically focused high-quality sources (e.g., musicbrainz, geospecies, openstreetmap, etc.), is a great opportunity. This data is huge, structured but highly heterogeneous, and includes substantial parts of uncertain or incomplete nature. Internet archives and Web tables (embedded in HTML pages) are further examples; enormous amounts of interesting data are easily and legally available by crawling or download. Finally, in times when energy, traffic, environment, health, and general sustainability are key challenges on our planet, more and more data by public stakeholders is freely available. Large amounts of structured and statistical data can be accessed at organizations like OECD, WHO, Eurostat, and many others.
Merely pointing to these opportunities is not enough. We must give more incentives that papers do indeed provide new interesting data resources and open-source software. The least thing to do is to extend review reports to include the contribution of novel data and software. A more far-reaching step is to make data and experiments an essential part of the academic currency: how many of your papers contributed data resources, how many contributed open-source software? This should matter in hiring, tenure, and promotion decisions. Needless to say, all this applies to non-trivial, value-adding data resource contributions. Merely converting a relational database into another format is not a big deal.
I believe that computational linguistics is a great role model for experimental culture and the value of data. Papers in premier conferences earn extra credit when accompanied with data resources, and there are highly reputed conferences like LREC which are dedicated to this theme. Moreover, papers of this kind or even the data resources themselves are frequently cited. Why don’t we, the database community, adopt this kind of culture and give data and data-driven experiments the role that they deserve in the era of Big Data?
Is the Grass Always Greener on the Other Side of the Fence?
Some people may argue that rapidly changing setups for data-driven experiments are not viable in our community. In the extreme, every paper could come with its own data resources, making it harder to ensure the reproducibility of experimental results. So we should better stick to established benchmarks like TPC-H and DBLP author cleaning. This is the opponent’s argument. I think the argument that more data resources hinder repeatability is flawed and merely a cheap excuse. Rather, a higher rate of new data resources and experimental setups goes very well with calling upon the authors’ obligation to ensure reproducible results. The key is to make the publication of data and full details of experiments mandatory. This could be easily implemented in the form of supplementary material that accompanies paper submissions and, for accepted papers, would also be archived in the publication repository.
Another argument could be that Big Data is too big to effectively share. However, volume is only one of the criteria for making a dataset Big Data, that is, interesting for research. We can certainly make 100 Gigabytes available for download, and organizations like NIST (running TREC), LDC (hosting NLP data), and the Internet Archive prove that even Terabytes can be shared by asking interested teams to pay a few hundred dollars for shipping disks.
A caveat that is harder to counter is that real-life workloads are so business-critical that they can impossibly be shared. Yes, there were small scandals about query-and-click logs from search engines as they were not properly anonymized. However, the fact that engineers did not do a good job in these cases does not mean that releasing logs and workloads is out of the question. Why would it be impossible to publish a small representative sample of analytic queries over Internet traffic data or advertisement data? Moreover, if we focus on public data hosted by public services, wouldn’t it be easy to share frequently posed queries?
Finally, a critical issue to ponder on is the position of industrial research labs. In the SIGMOD repeatability discussion a few years ago, they made it a point that software cannot be disclosed. Making experimental data available is a totally different issue, and would actually avoid the problem with proprietary software. Unfortunately, we sometimes see papers from industrial labs that show impressive experiments, but don’t give details nor any data and leave zero chance for others to validate the papers’ findings. Such publications that crucially hinge on non-disclosed experiments violate a major principle of good science: the falsifiability of hypotheses, as formulated by the Austrian-British philosopher Karl Popper. So what should industrial research groups do (in my humble opinion)? They should use public data in experiments and/or make their data public (e.g., in anonymized or truncated form, but in the same form that is used in the experiments). Good examples in the past include the N-gram corpora that Microsoft and Google released. Papers may use proprietary data in addition, but when a paper’s contribution lives or dies with a large non-disclosed experiment, the paper cannot be properly reviewed by the open research community. For such papers, which can still be insightful, conferences have industrial tracks.
Last but not least, who could possibly act on this? Or is all this merely public whining, without addressing any stakeholders? An obvious answer is that the steering boards and program chairs of our conferences should reflect and discuss these points. It should not be a complex maneuver to extend the reviewing criteria for the research tracks of SIGMOD, VLDB, ICDE, etc. This would be a step in the right direction. Of course, truly changing the experimental culture in our community and influencing the scholarly currency in the academic world is a long-term process. It is a process that affects all of us, and should be driven by each of you. Give this some thought when writing your next paper with data-intensive experiments.
The above considerations are food for thought, not a recipe. If you prefer a concise set of tenets and recommendations at the risk of oversimplification, here is my bottom line:
Overall, we need a culture shift to encourage more work on interesting data for experimental research in the Big Data wave.
| Blogger’s Profile:
Gerhard Weikum Gerhard Weikum is a Research Director at the Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also an adjunct professor in the Department of Computer Science of Saarland University in Saarbruecken, Germany, and he is a principal investigator of the Cluster of Excellence on Multimodal Computing and Interaction. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He received his diploma and doctoral degrees from the University of Darmstadt, Germany.
Big Data is the buzzword in the database community these days. Two of the first three blog entries of the SIGMOD blog are on Big Data. There was a plenary research session with invited talks at the 2012 SIGMOD Conference and there will be a panel at the 2012 VLDB Conference. Probably, everything has already been said that can be said. So, let me just add my own personal data point to the sea of existing opinions and leave it to the reader whether I am adding to the “signal” or adding to the “noise”. This blog entry is based on the talk that I gave at SIGMOD 2012 and the slides of that talk can be found at http://www.systems.ethz.ch/Talks .
Upfront, I would like to make clear that I am a believer. Stepping back, I am asking myself why do I work on Big Data technologies? I came up with two potential reasons:
In the following, I would like to explain my personal view on these two reasons.
Making the World a Better Place
The real question to ask is whether bigger = smarter? The simple answer is “yes”. The success of services like the Google and Bing are evidence for the “bigger = smarter” principle. The more data you have and can process, the higher the statistical relevance of your analysis and the better answers you get. Furthermore, Big Data allows you to make statements about corner cases and the famous “long tail”. Putting it differently, “experience” is more valuable than “thinking”.
The more complicated answer to the question whether bigger is smarter is “I do not know”. My concern is that the bigger Big Data gets, the more difficult we make it for humans to get involved. Who wants to argue with Google or Bing? At the end, all we can do is trust the machine learning. However, Big Data analytics needs as much debugging as any other software we produce and how can we help people to debug a data-driven experiment with 5 PB of data? Putting it differently, what do you make out of an experiment that validates your hypothesis with 5 PB of data but does not validate your hypothesis with, say, 1 KB of data using the same piece of code? Should we just trust the “bigger = smarter” principle and use the results of the 5 PB experiment to claim victory?
The more fundamental problem is that Big Data technologies tempt us into doing experiments for which we have no ground truth. Often, the absence of a ground truth is the reason of using Big Data: If we knew the answer already, we would not need Big Data. Despite all the mathematical and statistical tools that are available today, however, debugging a program without knowing what the program should be doing is difficult. To give an example: Let us assume that a Big Data study revealed that the left most lane is the fastest lane in a traffic jam. What does this result mean? Does it mean that we should all be going on the left lane? Does it mean that people on the left lane are more aggressive? Or does it mean that people on the left lane just believe that they are faster? This example combines all the problems of discovering facts without a ground truth: By asking the question, you are biasing the result. And by getting a result, you might be biasing the future result, too. (And, of course, if you had done the same study only looking at data from Great Britain, you might have come to the opposite conclusion that the right most lane is the fastest.)
Google Translate is a counter example and clearly a Big Data success story: Here, we do know the ground truth and Google developers are able to debug and improve Google Translate based on that ground truth – at least as long as we trust our own language skills more than we trust Google. (When it comes to spelling, I actually already trust Google and Bing more than I trust myself. )
Maybe, all I am trying to say is that we need to be more careful in what we promise and do not forget to keep the human in the loop. I trust statisticians that “bigger is smarter”, but I also believe that humans are even smarter and the combination is what is needed, thereby letting each party do what it is best at.
Because We Can
Unfortunately, we cannot make humans become smarter (and we should not even try), but we can try to make Big Data bigger. Even though I argued in the previous section that it is not always clear that bigger Big Data makes the world a better or smarter place, we as a data management community should be constantly pushing to make Big Data bigger. That is, we should build data management tools that scale, perform well, and are cost effective and get continuously better in all regards. Honestly, I do not know how that will make the world a better place, but I am optimistic that it will: History teaches that good things will happen if you do good work. Also, we should not be shy to make big promises such as processing 100 PB of heterogeneous data in real-time – if that is what our customers want and are willing to pay for. We should also continue to encourage people to collect all the data and then later think about what to do with it. If there are risks in doing all that (e.g., privacy risks), we need to look at those, too, and find ways to reduce those risks and still become better at our core business of becoming bigger, faster, and cheaper. We might not be
There are two things that we need to change, however. First, we need to build systems that are explicit about the utility / cost tradeoff of Big Data. Mariposa pioneered this idea in the Nineties; in Mariposa, utility was defined as response time (the faster the higher the utility), but now things get more complicated: With Big Data, utility may include data quality, data diversity, and other statistical metrics of the data. We need tools and abstractions that allow users to explicitly specify and control these metrics.
Second, we need to package our tools in the right way so that users can use them. There is a reason why Hadoop is so successful even though it has so many performance problems. In my opinion, one of the reasons is that it is not a database system. Yet, it can be a database system if combined with other tools of the Hadoop eco-system. For instance, it can be a transactional database system if combined with HDFS, Zookeeper, and HBase. However, it can also become a logging system to help customer support if combined with HDFS and SOLR. And, of course, it can
| Blogger’s Profile:
Donald Kossmann is a professor in the Systems Group of the Department of Computer Science at ETH Zurich (Switzerland). He received his MS in 1991 from the University of Karlsruhe and completed his PhD in 1995 at the Technical University of Aachen. After that, he held positions at the University of Maryland, the IBM Almaden Research Center, the University of Passau, the Technical University of Munich, and the University of Heidelberg. He is a former associate editor of ACM Transactions on Databases and ACM Transactions on Internet Technology. He was a member of the board of trustees of the VLDB endowment from 2006 until 2011, and he was the program committee chair of the ACM SIGMOD Conf., 2009 and PC co-chair of VLDB 2004. He is an ACM Fellow. He has been a co-founder of three start-ups in the areas of Web data management and cloud computing.