Most professional fields, whether in business or academia, rely on data and have done so for centuries. In the digital age and with the emergence of Big Data, this dependency is growing dramatically – perhaps out of proportion to its current value given the concepts, tools, and techniques presently available. For example, how do you tell if the results of data-intensive analysis are correct and reliable and not weak or even spurious? Most data-intensive disciplines have statistical measures that attempt to calculate meaning or truth. Efficacy quantifies the strength of a relationship within a system, such as biology or business. For example, when researchers investigate a new drug, they compare its effectiveness to a placebo, using statistics to determine whether the drug worked. This approach, where data selection and processing is predicated on complex models rather than simple comparison, is a far cry from select-project-join queries.
Efficacy is the capacity to produce a desired result or effect. In medicine, it is the ability of an intervention or drug to produce an outcome. P-values have been a conventional empirical metric of efficacy for 100 years.
Moreover, the underlying data in these fields is complex, uncertain, and multimodal. Despite a large body of research data management for science applications, there has been little adoption of relational techniques in the science disciplines. In this post, we examine two challenges. First, modeling data around domain-specific efficacy rather than set theory. Second, support for ensembles of data models to enable many perspectives on a single data set.
The big picture is compelling. Since the late 1980’s one of us estimated in papers and keynotes that databases contain less than 10% of the world’s data and dropping fast as non-database data growth exploded. A corresponding fraction of the world’s applications – data and computation – are amenable to traditional databases. Modelling the 90% opens the door for the database community to the requirements of the rest of the world’s data and a new, vastly larger generation of database research and technology. This calls for a shift in our community commensurate with the profound changes introduced by Big Data.
Efficacy first, then efficiency
Since meaning and truth are relative to a system, efficacy measures are of accuracy, correctness, precision, and significance with respect to a context. That we can compute an answer efficiently – at lightning speed over massive data sets – is entirely irrelevant or even harmful if we cannot demonstrate that the answer is meaningful or at least approximately right in a given context. As fields develop and complexity increases, efficacy measures become increasingly sophisticated, refined, and debated. For example, p-values – the gold standard of empirical efficacy – have been questioned for decades, especially under the pressure of increasing irreproducibility in science. The same is true for precision and recall in information retrieval. In fact, since most fields that depend on data involve uncertainty, measures of efficacy are being questioned everywhere, with the notable exception of data management.
Big Data, broadly construed, is inherently multidisciplinary but often lacks the efficacy measures of its constituent disciplines – statistics, machine learning, empiricism, data mining, information retrieval, among others – let alone those of application domains such as finance, biology, clinical studies, high-energy physics, drug discovery, and astrophysics. One reason for this is that efficacy measures that have been developed in the small data world, based on statistics and other fields, do not necessarily hold true over massive data sets . Efficacy in this context is an important, open, and rich research challenge. The value and success of data-intensive discovery (Big Data) depends on achieving adequate means of evaluating the efficacy of its results. A notable exception is the Baylor-Watson result  that focused first on efficacy, i.e., modeling, that then contributed to efficiency. But efficacy is one aspect of a larger challenge – modeling.
Relational data is the servant of the data model and the query. It was right to constrain data when we had a well-defined model. And we could always get the model right – right?
As data management evolved it distinguished itself from information retrieval by not requiring efficacy measures since databases were bounded, discrete, and complied with well-defined models, e.g., schemas. In contrast, information retrieval (and later machine learning) searched data sets for complex correlations rather than rigidly defined predicates. Finding relationships like “select all pairs where their covariance is greater than x” are inherently iterative and compute-intensive. In contrast, the contents of a database either match a query or they do not – black or white. No need for estimating accuracy, confidence, or probabilities. Relational data is the servant of the data model and the query. This permitted massive performance improvements that led to the widespread adoption of databases in applications for which schemas made sense. If the data did not comply with the schema and the query within that, then the data was erroneous by definition and should be rejected or corrected. It was right to constrain data when we had a well-defined model. And we could always get the model right – right? Many courageous researchers over the past 50 years have studied this problem, including probabilistic databases and fuzzy logic, (and more ) but none has seen widespread adoption. Why?
While the non-database world – life sciences, high-energy physics, astrophysics, finance – opened the door to Big Data and its possibilities, the data management world is aspiring to take ownership of their infrastructure – the storage, management, manipulation, querying, and searching of massive datasets. Currently much of this work is done in an ad-hoc manner using tools like R and Python. What is required for a more general solution? The non-database world is driven by applications – solving problems with real-world constraints – achieving efficacy within the models and definitions of their domain – often with 400 or 500 years of history.
In contrast, the database landscape is predominantly concerned with efficiency and has not dealt head on with efficacy yet. Some of these issues have been addressed in the database context in terms of specific models, languages, and design, but seldom have those concerns impacted the core database infrastructure, let alone gained adoption. Perhaps database researchers focused only on application domains that are well-behaved. While efficacy is a critical requirement – possibly the most critical requirement – in domains that make extensive use of data, it is part of the broader requirement for modelling unmet by database systems.
For more than a decade physics, astrophysics, photonics, biology, indeed most physical sciences as well as statistics and machine learning have made the modest assumption that multiple perspectives may be more valuable than a single model.
Data Models for the 90%
The database community, like many others, perhaps has not fully internalized the paradigm shift from small to big data. Big Data – or data per se – does not create the change nor is itself the change. Big Data opens the door to a revolution in thinking. One aspect involves data-driven methods. A profound shift involves viewing phenomena from multiple perspectives simultaneously.
A significant aspect of this shift is that every Big Data activity (small data activities also, but with less impact) requires measures of efficacy for each perspective or model. This is not simply owing to the reframing of corresponding principles from empirical science, but also to the multiple meanings of data, each of which requires mechanisms for addressing efficacy.
If the data management community is about to provide solutions for this nascent challenge, then it will need to deal with efficacy. This essentially has to do with modelling, a chaotic and ad-hoc database topic that has been largely unsuccessful, again measured by adoption. The relational model has dominated databases for over 40 years largely owing to efficiency. The database community knows how to optimize anything expressed relationally. While the relational model has proven to be amazingly general, its adoption has been limited in many domains, especially the sciences.
A related limitation of the database world is the assumption of a single perspective, e.g., a single version of truth, one schema per database even with multiple views. For more than a decade physics, astrophysics, photonics, biology, indeed most physical sciences as well as statistics and machine learning have made the modest assumption that multiple perspectives may be more valuable than a single model.
In  the author argued that science undergoes paradigm shifts only when there are rival theories about the fundamentals of a discipline. It is his position that rival paradigms are incommensurable using entirely different concepts and evaluation metrics from one another. One such example was the wave and particle theories of light. Each has entirely different models and measures of efficacy. Understanding the big picture necessitates finding consistencies and anomalies in both theories.
Ensemble models are one approach to addressing this challenge. Let’s consider an example in evolutionary biology where researchers use a collection of models to learn about how the human genome has changed over time. In  the authors identified positive examples of natural selection in recent human populations. Their discoveries have two parts: the affected gene’s location and its (improved) mutation. By composing many signals of natural selection, the authors increase the resolution of their genomic map by up to 100x. This research computes genetic signals at many levels, from clustering genes that are likely to be inherited together to looking at the high-level geographic distribution of different mutations. In present database modeling, the former might be represented as a graph database, whereas the latter is more likely to fall into the purview of geospatial databases. How can we bring them together? Perhaps neither of these models is designed for computing how effective different genetic variations are at producing advantageous traits. This pattern repeats itself in meteorology, physics, and a myriad of other domains that mathematically model large, dynamic systems.
Stepping into the void of uncertainty, unboundedness, ensemble models, and open-ended model exploration is far harder and scarier. We call it Computing Reality
Ensemble models pose substantial challenges to the data management community. How do you simultaneously store, manage, query, and update this variety of models, applying to a single dataset with many, possibly conflicting schemas? Database folks may first be concerned about doing this efficiently. Nope – wrong question. The first step is to understand the problem, to ask the right questions, to get the model correct and only then to make it efficient. How do you support ensemble models and their requirements including efficacy?
This may be why application domains that use massive data sets have grown their own data management tools, such as Hadoop, ADAM, Wikidata, and Scientific Data Management Systems, let alone a plethora of such tools in most physical science communities that the database community has never heard of. It’s not just that their data does not fit the relational model; databases do not support ensemble models, efficacy, or many of the fundamental concepts used to understand data. Why would any application domain (e.g., physical sciences, clinical studies, drug discovery) or discipline (e.g., information retrieval, machine learning, statistics) want to partner with an infrastructure technology that did not support its basic principles?
The database community has developed amazing technology that has changed the world. Since the early 1990’s it has extended its models to non-relational models such for networks, text, graphs, arrays, and many more. But efficacy is not just an issue of expressing eScience applications relationally, as UDFs or in R, but modeling and computing hypotheses under the complex contexts defined by domain experts, none of which map easily to set theory or other discrete mathematics. Stepping into the void of uncertainty, unboundedness, ensemble models, and open-ended model exploration is far harder and scarier. We call it Computing Reality .
Big Data is opening the door to a paradigm shift in many human endeavors. Machine learning was first through the door with real, albeit preliminary, results and it is already on to the next generation with deep learning . Analytics and other domains are riding the wave of machine learning. The database community is heading for the door now, but it will be challenging. We first have to understand the problem and get the requirements right. To paraphrase Ron Fagin, we need to focus on asking the right questions. The rest may be a breeze but efficacy before efficiency!
So not only are we leaving the relational world that was dominated one model or a class of discrete models, but we are leaving the world of a single model for each dataset and embarking on a journey into a world of ensemble models of including probabilistic, fuzzy, and even potentially the richest model of them all, a model-free approach that enables us to listen to the data. All at scale. This seems scary to us but also just what we need.
Are we crazy, naive? Isn’t it our mission to dig in this data goldmine, to contribute to accelerating scientific discovery? What do you think? We are all ears.
 Duggan, Jennie and Michael L. Brodie, Hephaestus: Virtual Experiments for Data-Intensive Science, In CIDR 2015 (to appear)
 Gomes, Lee. Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Other Huge Engineering Efforts, IEEE Spectrum, 20 Oct 2014
 Grossman, Sharon R., et al. “A composite of multiple signals distinguishes causal variants in regions of positive selection.” Science 327.5967 (2010): 883-886.
 National Research Council. Frontiers in Massive Data Analysis. Washington, DC: The National Academies Press, 2013
 Scott Spangler, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R. Pickering, Austin Comer, Jeffrey N. Myers, Ioana Stanoi, Linda Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence Donehower, Ying Chen, and Olivier Lichtarge. 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’14). ACM, New York, NY, USA, 1877-1886. DOI=10.1145/2623330.2623667 http://doi.acm.org/10.1145/2623330.2623667
 Kuhn, Thomas S. The structure of scientific revolutions. University of Chicago press, 2012.
| Bloggers’ Profiles:
Dr. Brodie has over 40 years experience in research and industrial practice in databases, distributed systems, integration, artificial intelligence, and multi-disciplinary problem solving. He is concerned with the Big Picture aspects of information ecosystems including business, economic, social, application, and technical. Dr. Brodie is a Research Scientist, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology; advises startups; serves on Advisory Boards of national and international research organizations; and is an adjunct professor at the National University of Ireland, Galway. For over 20 years he served as Chief Scientist of IT, Verizon, a Fortune 20 company, responsible for advanced technologies, architectures, and methodologies for Information Technology strategies and for guiding industrial scale deployments of emergent technologies. His current research and applied interests include Big Data, Data Science, and data curation at scale and a related start up Tamr.com. He has served on several National Academy of Science committees. Dr. Brodie holds a PhD in Databases from the University of Toronto and a Doctor of Science (honoris causa) from the National University of Ireland.
Jennie Duggan is a postdoctoral associate at MIT CSAIL working with Michael Stonebraker and an adjunct assistant professor at Northwestern University. She received her Ph.D. from Brown University in 2012 under the supervision of Ugur Cetintemel. Her research interests include scientific data management, database workload modeling, and cloud computing. She is especially focused on making data-driven science more accessible and scalable.
Crowdsourcing is a powerful paradigm that democratizes the creation, collection, analysis, curation, and dissemination of data, contributed by millions of individuals, who are called workers. Early crowdsourcing platforms in the context of citizen reporting and citizen sciences were dedicated to specific tasks. Today, many other platforms that expect workers with different expertise exist. Among them, we can find Turkit, Mob4hire, uTest, Freelancer, eLance, oDesk, Guru, Topcoder, Trada, 99design, Innocentive, CloudCrowd, and CloudFlower. The most popular one is Amazon Mechanical Turk (AMT) that is generic and allows any requester to post virtually any kind of task in which any worker can participate via self-appointment.
Tasks can be as simple as binary requests such as recognizing a landmark in a city, or more sophisticated ones such as meta-data enrichment (picture and video tagging in Flickr or YouTube, food description in Open Food Facts), opinion solicitation (restaurants reviews), collaborative intelligence (city map alignment in http://maps.nypl.org/warper/, identifying stars in GalaxyZoo, character recognition in Recaptcha), and knowledge-intensive crowdsourcing with tasks such as collaborative editing (Wikipedia) and idea generation. The database community has been offering its own platforms (e.g., Qurk on top of AMT and Deco for query processing) and today, scientists are resorting to crowdsourcing for a particular kind of task, that is, quality experiments to validate their findings.
There are several challenges one faces in crowd-based quality experiments. Some are inherent to using a crowd. For example, worker filtering to make sure only qualified ones participate, has to be done by task requesters. Those qualification tests are often difficult to design and do not always lead to flawless experiments. Data collection to build worker profiles is also a challenge (e.g., workers are asked to rate at least 30 movies to build a profile). Other challenges depend on the actual platform. The availability of a large worker pool and the ability to easily set up virtually any task, make AMT the platform of choice for most research experiments. In fact, AMT is used by most researchers in the DB community, and tasks range from answering simple binary questions such as determining whether a tweet expresses a positive or a negative sentiment, to more sophisticated tasks that require some domain knowledge such as collaborative journalism. However, the closed nature of AMT (and of all crowdsourcing platforms we know of today) forces task requesters to find workarounds outside the system in order to select the right pool of workers and assign to them the right set of tasks (AMT is often used as a hiring platform only and workers are redirected to another website). Moreover, monetary compensation, as is the case in AMT, is not always the best choice for incentivizing workers (think open-source contributions) and tends to attract more cheaters when the reward is too high. To address that, post-quality control is enforced via majority voting or by monitoring task completion time. The bottom line is that in a volatile environment, enforcing some level of correctness is difficult, time-consuming and costly.
I have used AMT in several qualitative analyses. The first experiment was in 2009  to evaluate travel itineraries extracted from Flickr against carefully handcrafted ones. It was a large-scale experiment “to validate the wisdom of the crowd”. About 450 workers participated in the experiment and evaluated itineraries in 5 popular and geographically distributed cities (San Francisco, New York City, London, Paris and Barcelona). Half of the workers were involved in an independent study that evaluated the usefulness of our itineraries, the points of interests they contained and the landmark visit and transit times. The other half participated in a comparative study where our itineraries were put side-by-side with handcrafted ones and were preferred in a large majority of the cases. It worked so well that the Wall Street Journal, a prominent newspaper in the US, talked about it . In this experiment, special attention was dedicated to the design of qualification tests to filter out workers. We had to think carefully about a test for which answers are not readily available with simple Web search. Below is one of the tests we used.
The second time we used AMT  relied on input from 100 workers to evaluate how well different rating aggregation functions (least misery function, majority voting, pairwise disagreement, and global average) to compute movie recommendations from user groups with different characteristics (small/large groups, similar/dissimilar groups). We found several insightful and sometimes surprising results including that least misery, i.e., optimizing for the lowest rating in a group, performed better than averaging ratings and is the best aggregation function for small groups formed by similar users. Pairwise disagreement on the other hand, did very well with large groups of dissimilar users. In the first round of that experiment, we had inconsistent findings. We then looked carefully in AMT logs and realized that there were cheaters whose task completion time was unrealistic.
The latest work we used AMT for was crowdsourcing the validation of articles produced collaboratively by a crowd to another crowd . Using the crowd to validate the crowd is in fact a common practice and has shown to work well in many applications such as sentence translation by non-experts. Participating workers were instructed to join a group and together write a short summary of current world events on some topic. The qualification test included answering factual questions on those events. Of course, cheaters could use their favorite search engine to find answers to the qualification test.
I hope that the examples above illustrate both the advantage of having a generic platform that enables reaching out to a large worker pool but also showcases the challenges of using crowds for qualitative analysis. Using crowds bears similarities to running surveys and to using cohorts in medical drug trials. While there are practices on how to select such cohorts and ensure result quality (e.g., controlled/test groups, independently chosen subjects, double blind tests in medical trials), none exist for running a crowd-based quality experiments.
With my colleagues Beatrice Valeri and Shady El Bassuoni, we examined 4395 papers published in SIGMOD, PVLDB, ICDE, EDBT, ICDM, KDD, ICWSM, RecSys, and Hypertext between 2010 and 2014 (2014 statistics are partial) and found that 60 of them used crowd-based quality experiments. Among those, all of them use AMT or CrowdFlower, a layer on top of AMT. Most tasks are about collecting golden data used in evaluation and among those, labeling data is the most common. 48 out of the 60 papers have at least one author in the USA. European authors contributed to a total of 14 papers followed by Israel (6 papers), China (4 papers), Qatar and Canada (3 papers each), Singapore and Japan (2 papers each). As shown below, the number of publications that crowdsource experiments has been increasing with a worker base ranging from around 20 to 500 per experiment.
Most papers use workers’ acceptance rate (recorded in AMT based on previous tasks completed by those workers) and only a few use a thorough qualification test. Those tests are followed by a post-processing step where majority voting, manual checking (by involving another crowd), task completion time, and answer justification, are used to check workers’ input. There goes design efficiency.
What is our opportunity today?
I do not expect our community to suddenly change the course of things. I myself still run experiments on AMT and will continue to do it. Of course, there are questions related to where to start and how to maintain the platform. Crowd4U (crowd4u.org) is a good place to start. It’s an all-academic generic platform that is being developed at the Univ. of Tsukuba and that is open to the rest of the academic world. Here in Grenoble, we ran a campaign on campus where we advertised crowdsourcing and gave out cookies and drinks (not totally free!) and passers-by performed over 1600 tasks to label tweets in less than 3 hours.
You can start your own platform or join Crowd4U but keep it open! In , we argue that human factors such as worker skills, worker availability, expected wage, and ability to work together, could be accounted for to make better use of the crowd. Such parameters can only be tested in an open and generic crowdsourcing system where task assignment and skill learning algorithms can be deployed for virtually any task.
So let’s take this opportunity, have some wisdom, and build and nurture our own crowdsourcing platform(s).
 Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das, Cong Yu: Space efficiency in group recommendation. VLDB J. 19(6): 877-900 (2010)
 Senjuti Basu Roy, Ioanna Lykourentzou, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das: Optimization in Knowledge-Intensive Crowdsourcing. CoRR abs/1401.1302 (2014)
| Blogger’s Profile:
Sihem Amer-Yahia is a 1st class CNRS (Centre National de Recherche Scientifique) Research Director at LIG (Laboratoire d’Informatique de Grenoble) in France. Sihem heads the SLIDE group (ScaLable Information Discovery and Exploitation) that sits at the intersection of large-scale data management and Web data analytics with an emphasis on the social and the semantic Web. Until July 2012, Sihem was Principal Scientist at the Qatar Computing Research Institute (QCRI) where she led a group in Social Computing and worked with local Universities on student mentoring and with Al Jazeera Online on news traffic analytics. From 2006 to 2011, she was Senior Scientist at Yahoo! Research and worked on revisiting relevance models and scalable top-k processing algorithms for Delicious, Yahoo! Travel and Personals, and Flickr. Before that, she spent 7 years at at&t Labs in New Jersey, working on XML query optimization and XML full-text search in conjunction with the W3C. Sihem has served on the SIGMOD Executive Board, is a member of the VLDB and the EDBT Endowments. She serves on the editorial boards of ACM TODS, the VLDB Journal and the Information Systems Journal. She was track chair of SIGIR 2013 and of PVLDB 2013. She is PC chair of EDBT 2014 and will be PC chair of BDA 2015 (French DB conference) and PC co-chair of SIGMOD Industrial 2015. Sihem received her Ph.D. in Computer Science from Paris-Orsay and INRIA in 1999, and her engineering degree from ESI/INI, Algeria.
Exploratory search has gained prominence in recent years. There is an increased interest from several scientific communities, from the information retrieval and database to human-computer interaction and visualization communities, in moving beyond the traditional query-browse-refine model supported by web search engines and database systems alike, and towards support for human intelligence amplification and information understanding. Exploratory search systems should inherently support symbiotic human-machine relationships that provide guidance in exploring unfamiliar information landscapes [Whi 09]. In this post, four researchers set to explore exploratory search from different angles and study the emerging needs and objectives as well as the challenges and problems that need to be tackled.
Exploratory Search in Structured Data: Data Warehouses and OLAP to the Rescue?
Looking more closely at what problems the field of exploratory search encompasses, Marchionini has established two main aspects of exploratory search that go beyond the classical problem of looking up information using classical search mechanisms [Mar 06]. More specifically, exploratory search has the general goal of (i) learning, thus acquiring new knowledge and (ii) investigating to possibly reveal new facts. Sub-tasks to achieve these goals include:
Clearly, exploratory search applies to both structured and unstructured data, as we want to leverage as much data as possible in the above tasks. However, if we focus on structured data, we notice that quite a few of the tasks mentioned above resemble those for which data warehouses and online analytical processing (OLAP) have been developed. One example of a learning task supported by data warehouses is integration, as one of the main purposes of data warehouses is to integrate data from distributed and heterogeneous sources in a single repository. Consequently, data comparison is easily supported as well. As for supporting investigation tasks, OLAP and data mining algorithms readily support analysis of data stored in a data warehouse, and reports synthesize such data. Finally, data warehouses are deployed to support decision making, such as planning and forecasting.
So are the problems of exploratory search already solved for structured data through data warehouses and OLAP?
To some extent yes but there are still many areas of exploratory search that data warehouses do not cover, such as socialization, comprehension and interpretation, and discovery. In the following, I will highlight three main areas where I believe exploratory search will require novel approaches to reach its full potential.
Discovery. Data warehouses are designed based on a predefined information demand, meaning that the analytical queries they support are fixed in advance and for sure cannot go beyond queries that the schema of the data warehouse supports. Essentially, the data warehouse is designed to efficiently answer predefined queries. Opposed to that, in exploratory search, the information demand is unknown a priori, the goal of discovery being to also reveal things that users of the system did not think about during design time and thus are not covered by the rigid schema. Hence, system design for exploratory search needs to plan for the unplanned. Making an analogy to farming, we can say that whereas data warehouses and OLAP are useful for harvesting what has been planted, exploratory search is all about exploring and laboring new land.
Adaptation. Given the rigid cage schemas impose on a data warehouse, these systems typically have limited capabilities for changing or evolving information needs. But these are the essence of exploratory search, as learning is an iterative process that requires a search-refine-expand paradigm. That is, whereas data warehouses allow us to freely move in a cage, exploratory search systems need to be able to adapt to a new and rapidly changing environment.
Data and Users. As a final distinction here, we have limited the discussion above to structured data, but obviously, a wealth of information resides in unstructured data that exploratory search should natively support. This brings us back to a long going discussion on integrating databases and information retrieval. Additionally, many limitations of the data warehouse come from the rigid schema, so could maybe NoSQL databases be of help here? Finally, data warehouses and OLAP are targeted towards expert users that know their domain, whereas exploratory search is for the masses and human-computer interaction will be key to success.
To conclude, we have seen that data warehouses may support to some extent exploratory search, but they have serious limitations when it comes to discovery, adaptability, and the support of all types of relevant data and users.
Exploratory Search and Web searching
Web searching is probably the most frequent user task in the Web, during which users typically get back a linear list of hits. In fact, web search engines are very good in ranking, and there is a plethora of ranking methods for capturing various aspects (topic relevance, authority, popularity, credibility and personalization). However, ranking is not enough. Ranking is adequate for focalized (else called precision-oriented) information needs, however the majority of information needs are recall-oriented (according to various user studies). For recall-oriented information needs, the user wants to find and understand more than one search hit. Examples include bibliographic survey writing, medical information seeking, patent search, hotel booking, and car buying. In general, recall-oriented information needs have an exploratory nature and aim at decision making (based on one or more criteria). To support such information needs, it is beneficial to provide methods that:
The last years we, in the University of Crete and FORTH-ICS, have been investigating methods that can complement the classical Web searching with functionality for exploratory search. At first, we developed the web search engine (WSE) Mitos, a system which offers faceted search over the results of the submitted queries. Specifically, it supports facets corresponding to metadata attributes of the web pages (static metadata), as well as facets corresponding to the outcome of snippet-based clustering algorithms (a kind of dynamic metadata). The user can then restrict her focus gradually, by interacting with the resulting multidimensional structure through simple clicks.
The system IOS [Faf 12] was developed next, providing analogous features but during query typing (type-ahead search). This functionality is offered for the frequent queries and takes advantage of novel partitioned trie-based indexes for reducing the increased main memory requirements. Subsequently, through XSearch [Faf 12b] we investigated and showed how the available Linked Open Data (LOD) can be exploited for offering facets based on the outcome of entity mining techniques. Again, this approach is applied at query time over the textual snippets of the search hits, where LOD provides information about the names of the entities of interest. This approach has been proved successful for professional search, specifically in the domain of patent search and marine search. With X-ENS [Faf 12c] we focused on configurability, i.e., allowing the end user to configure the entities of interest, and also browse the information that LOD provides for these entities.
However, to offer entity mining not over the snippets, but over the full contents of the search hits, more than one machines are needed for downloading and analyzing the full contents of the hits. A recent work shows how MapReduce can be used for this purpose [Kit 14].
Subsequently, we investigated how we could enable the user to easily express preferences, simple and complicated ones, over the elements the user interacts with. Such preferences can affect the ordering of the different aspects of the multidimensional (and hierarchically organized) information space. We have developed the prototype system Hippalus that realized this approach over a proposed preference framework [Pap 14].
More information about the aforementioned systems is available here.
Based on our experience, we could identify four main topics that we believe current IR/Web search does not cover, and exploratory search will require to reach its full potential:
Ubiquity. Faceted browsing of search results and gradual restriction should be possible for any kind of query, for any domain, and with no predetermined facets. In other words, methods that bypass the need for explicit configuration (regarding facets, entity types, LOD sources) are required.
Fusion of Structured and Unstructured Content. The exploitation of LOD in the exploratory search process is promising, e.g. for Named Entity Recognition and disambiguation. However, the fusion of structured and unstructured content requires more work.
User Control. Explicit, user-provided, and controllable preference management is beneficial for supporting a transparent decision making process. We believe that the framework supported by the Hippalus system is a first step towards this direction.
Evaluation. We need easy-to-follow methods for evaluating the effectiveness of exploratory search methods, and easily reproducible evaluation results. Although the classical IR has well established methodologies for evaluation, things are not so clear and straightforward in interactive IR (IIR).
Exploratory Search and Multimedia Data
By its very nature, multimedia data exploration shares the 3V challenges ([V]olume, [V]elocity, and[V]ariety ) of the so called “Big Data” applications. Systems supporting multimedia data exploration, however, must tackle additional, more specific, challenges, including those posed by the [H]igh-dimensional, [M]ulti-modal (temporal, spatial, hierarchical, and graph-structured), and inter-[L]inked nature of most multimedia data as well as the [I]mprecision of the media features and [S]parsity of the observations in the real-world.
Moreover, since the end-users for most multimedia data exploration tasks are us (i.e., humans), we need to consider fundamental constraints posed by [H]uman beings, from the difficulties they face in providing unambiguous specifications of interest or preference, subjectivity in their interpretations of results, and their limitations in perception and memory. Last, but not the least, since a large portion of multimedia data is human-centered, we also need to account for the users’ (and others’) needs for [P]rivacy.
Multimedia exploration (on data with the above characteristics and by users with the above limitations) is an inherently dynamic process, and systems for multimedia exploration must be able to support, efficiently and effectively, a continuous exploration cycle involving four key steps:
Naturally, each and every step of this multimedia exploration cycle poses significant challenges. While it would be impossible to enumerate all the challenges, I would like to identify the following five “core” challenges:
Unfortunately, while as a community we have done great advances in tackling these core challenges, we probably have to admit that we are still quite far from addressing any of these issues satisfactorily, especially within the context of large multimedia data collections. In fact, at least in the short- to medium-term future, these five issues will continue to form the core challenges in multimedia exploration and retrieval.
If there is one thing that is becoming more urgent, however, it is that the ever-increasing scale and the speed of the data implies that to support the above media exploration cycle, our emphasis must shift towards development of integrated data platforms that can support, in an optimized and scalable manner, both media analysis (feature extraction, clustering, partitioning aggregation, summarization, classification, latent analysis) and data manipulation (filtering, integration, personalized and task-oriented retrieval) operations.
Personalizing Data Exploration: Exploring the Past
Personal data is now pervasive as digital devices are capturing every part of our lives. Data is constantly collected and saved by users, either voluntarily in files, emails, social media interactions, multimedia objects, calendar items, contacts, etc., or passively by various applications such as GPS tracking of mobile devices, records of utility usage, financial transactions, or quantified self sensors. A typical user will have data recorded on many devices, cloud services, and proprietary systems; access to the data can be difficult because of the data fragmentation, security issues, or practical concerns.
Companies and organizations have famously been able to take advantage of the wealth of information produced by individuals, learning patterns and visualizing trends on a large number of data points produced by many users. Yet, individual uses do not have accessible tools to retrieve, manage and analyze their own data.
Leveraging personal data is critical to many data exploration tasks:
Exploring our past. A specific type of data exploration is re-finding [Tee 04], [Ber 08], whose goal is to find information that has been created, received, or seen by the user. Unlike traditional web search or exploration tasks whose focus is usually on discovering new information/documents, a re-finding exploration task has a target object it is attempting to recover.
Fragmentation of data is a main challenge. Users rarely own and store their personal data, with the exception of personal files. Most personal information is stored in the cloud by commercial companies that may offer some limited access, usually via a web browser or an API, to a user’s personal data. Attempting to retrieve and cross-reference personal information then leads to a tedious, often maddening, process of individually accessing all the relevant sources of data and manually linking their information. For example, checking that an insurance claim was correctly processed may require looking at a calendar application to find the doctor’s appointment date, checking the claim status on the insurance web site, and consulting a bank web portal to confirm that the payment was received.
The future of personal data exploration lies in the creation of personal data exploration tools, in the spirit of Dataspaces [Blu 07], [Hal 06], that will integrate personal data from a variety of sources, and allow users to visualize and search through their digital memories based on any piece of information they remember, following threads of information to navigate from one memory to the next.
Exploring our social data. Personal data is not limited to a user’s own data production. In an increasingly connected world, what other users in our social network share with us is also relevant to our data exploration. Users looking for book recommendations may want to explore their social network connections for books they have read or discussed. Some desktop search systems, such as deskWeb [Zer 10], have integrated the user’s social network graph, expanding the searched data set to include information available in the social network.
Personalizing our data exploration. Users have individual habits and patterns, as well as different types of data, context and information needs. By inferring knowledge from their past personal information and their interactions with their data, we can personalize data exploration and fit it to the users’ individual needs.
Social network information can be leveraged to learn contextual information about a user and guide data exploration. For instance, “Student Center” has different meaning for different groups of users. Knowing which social group the user belongs to can help us identify entities, e.g., observing that a majority of the user’s friends go to Rutgers University and attend events at “Busch Campus Student Center, Rutgers University”, increases the likelihood that the phrase “Student Center” in the user’s data and queries refers to that particular student center. Similarly, past queries can be used to learn some information about the type of the entities present in the personal data, or about the personal context.
This type of personalized data exploration is gaining traction in prospective memory applications, such as Google Now, which focus on reminding users of their appointments and to-do items.
Overall, understanding and leveraging personal data is crucial for many data exploration tasks, either because the exploration is on the personal information itself, or because personal information can give significant insights and directions for traditional data exploration needs.
| Blogger Profiles:
Melanie Herschel is an Associate Professor in the Computer Science Department of the University of Paris South and is also member of the INRIA Oak project. Before this, after receiving her Ph.D. in Computer Science from Humboldt-University Berlin (2008), she was a postdoctoral researcher at IBM Research – Almaden (2008 – 2009) and at the University of Tübingen, Germany (2009 – 2011). During her research career, she has focused on topics related to data integration, data quality, data provenance, and, more recently, on processing large amounts of Web Data. Melanie participates in several nationally funded research projects and is an active member of the database research community, as she has served on numerous program committees, is a frequent journal referee, and has organized international workshops.
Yannis Tzitzikas is Assistant Professor in the Department of Computer Science of the University of Crete and Research Associate of FORTH-ICS. His research interests fall in the intersection of the following areas: Information Systems, Information Indexing and Retrieval, Conceptual Modeling, Knowledge Representation and Reasoning. He is one of the editors (and author of several chapters) of the book “Dynamic Taxonomies and Faceted Search” (Springer 2009), he is currently involved in the ongoing EU projects i-Marine, SCIDIP-ES, Aparsen NoW (WP leader), and he is national contact of the MUMIA (Multilingual and Multifaceted Interactive Information Access) Cost Action. His current research revolves around Faceted Interactive Search, and lately on methods that can bridge the world of documents with the world of data at search time. He has published more than 80 papers in refereed international conferences and journals (including ACM Transactions on the Web, VLDB Journal, Journal on Data Semantics, International Semantic Web Conference (winner of best paper award), ESWC, and many others).
K. Selçuk Candan is a Professor of Computer Science and Engineering at the Arizona State University. He has published over 170 journal and peer-reviewed conference articles, one book on multimedia retrieval, and 16 book chapters. He has 9 patents. Prof. Candan served as an associate editor of one of the most respected database journals, the Very Large Databases (VLDB) journal. He is also in the editorial board of the IEEE Transactions on Multimedia and the Journal of Multimedia. He has served in the organization and program committees of various conferences. In 2006, he served as an organization committee member for SIGMOD’06, the flagship database conference of the ACM. In 2008, he served as a PC Chair for another leading, flagship conference of the ACM, this time focusing on multimedia research (MM’08). More recently, he served as a program committee group leader for ACM SIGMOD’10. He also serves in the review board of the Proceedings of the VLDB Endowment (PVLDB). In 2011, he served as a general co-chair for the ACM MM’11 conference. In 2012 he served as a general co-chair for ACM SIGMOD’12. He has successfully served as the PI or co-PI of numerous grants, including from the National Science Foundation, Air Force Office of Research, Army Research Office, Mellon Foundation, HP Labs, and JCI. He also served as a Visiting Research Scientist at NEC Laboratories America for over 10 years. He is a member of the Executive Committee of ACM SIGMOD and an ACM Distinguished Scientist.
Amélie Marian is an Associate Professor in the Computer Science Department at Rutgers University. Her research interests are in Personal Information Management, Ranked Query Processing, Semi-structured data and Web data Management. Amélie received her Ph.D. in Computer Science from Columbia University in 2005. From March 1999 to August 2000, she was a member of the VERSO project at INRIA-Rocquencourt. She is the recipient of a Microsoft Live Labs Award, three Google Research Awards, and an NSF CAREER award.
Is Query Optimization a “solved” problem? If not, are we attacking the “right” problems? How should we identify the “right” problems to solve?
I asked these same questions almost exactly 25 years ago, in an extended abstract for a Workshop on Database Query Optimization that was organized by the then-Professor Goetz Graefe at the Oregon Graduate Center [Grae 89a]. Remarkably and quite regrettably, most of the issues and critical unsolved problems I identified in that brief rant remain true today. Researchers continue to attack the wrong problems, IMHO: they attack the ones that they can, i.e., that they have ideas for, rather than the ones that they should, i.e., that are critical to successfully modeling the true cost of plans and choosing a good one. Perhaps more importantly, that will avoid choosing a disastrous plan! At the risk of repeating myself, I’d like to re-visit these issues, because I’m disappointed that few in the research community have taken up my earlier challenge.
The root of all evil, the Achilles Heel of query optimization, is the estimation of the size of intermediate results, known as cardinalities. Everything in cost estimation depends upon how many rows will be processed, so the entire cost model is predicated upon the cardinality model. In my experience, the cost model may introduce errors of at most 30% for a given cardinality, but the cardinality model can quite easily introduce errors of many orders of magnitude! I’ll give a real-world example in a moment. With such errors, the wonder isn’t “Why did the optimizer pick a bad plan?” Rather, the wonder is “Why would the optimizer ever pick a decent plan?”
“Well,” you say, “we’ve seen lots of improvements in histograms, wavelets, and other statistics since 1989. Surely we do better now.” There’s been no shortage of such papers, it’s true, but the wealth of such papers precisely illustrates my point. Developing new histograms that improve selectivity estimation for individual local predicates of the form “Age BETWEEN 47 AND 63” by a few percent doesn’t really matter, when other, much larger errors that are introduced elsewhere in cardinality estimation dwarf those minor improvements. It’s simple engineering, folks. If I have to review one more such paper on improved histograms for local predicates, I’ll scream (and reject it)! It just doesn’t matter! What we have now is good enough.
What still introduces the most error in cardinality estimation is (a) host variables and parameter markers, (b) the selectivity of join predicates, and, even more significantly, (c) how we combine selectivities to estimate the cardinality. Amazingly, these three topics also have enjoyed the least research attention, or at least the fewest number of papers attempting to solve them, unless I’ve missed some major contributions lately. I’ll visit each of these topics in turn, describing the fundamental causes of errors, and why those errors can easily reach disastrous proportions, illustrated by war stories from real customer situations.
Host variables and parameter markers
One of our major ISVs retrofitted a table with a field that identified which subsystem each row came from. It had 6 distinct values, but 99.99% of the rows had the value of the founding subsystem, i.e., when there was only one. A predicate on this subsystem column was added to every query, with the value being passed as a host variable. Not knowing that value a priori, DB2’s optimizer used the average value of 1/|distinct values| = 0.167, though that predicate’s true selectivity was usually 0.9999 (not selective at all) and once in a blue moon was 0.0001 (extremely selective).
There has been some work on this so-called Parametric Query Optimization (PQO), though it’s sometimes attacking the problem of other parameters unknown at compilation time (e.g. the number of buffer pages available) or limited to discrete values [Ioan 97]. One of my favorites is a fascinating empirical study by Reddy and Haritsa [Redd 05] of plan spaces for several commercial query optimizers as the selectivity of multiple local predicates are varied. It demonstrated (quite colorfully!) that regions in which a particular plan is optimal may not be convex and may even be disconnected! Graefe suggested keeping a different plan for each possible value of each host variable [Grae 89b], but with multiple host variables and a large number of possible values, Graefe’s scheme quickly gets impractical to optimize, store, and decide at run-time among the large cross-product of possible plans, without grouping them into regions having the same plan [Stoy 08].
Version 5 of DB2 for OS/390 (shipped June 1997) developed a practical approach to force re-optimization for host variables, parameter markers, and special registers by adding new SQL bind options REOPT(ALWAYS) and REOPT(ONCE). The latter re-optimizes the first time that the statement is executed with actual values for the parameters, and assumes that these values will be “typical”, whereas the former forces re-optimization each time the statement is run. Later, a REOPT(AUTO) option was added to autonomically determine if re-optimization is needed, based upon the change in the estimated filter factors from the last re-optimization’s plan.
Selectivity of join predicates
For example, suppose a fact table of Transactions has dimensions for the Products, Stores, and Dates of each transaction. Though current methods provide accurate selectivity estimates for predicates local to each dimension, e.g., ProductName = ‘Dockers’ and StoreName = ‘San Jose’ and Date = ’23-Feb-2013’, it is impossible to determine the effect of the intersection of these predicates on the fact table. Perhaps the San Jose store had a loss-leader sale on Dockers that day that expired the next day, and a similar sale on some other product the next day, so that the individual selectivities for each day, store, and product appear identical, but the actual sales of Dockers on the two days would be significantly different. It is the interaction of these predicates, through join predicates in this case, that research to date doesn’t address. This leads naturally to the final and most challenging problem in our trifecta.
Correlation of columns
My favorite example, which occurred in a customer database, is Make = ‘Honda’ and Model = ‘Accord’. To simplify somewhat, suppose there are 10 Makes and 100 Models. Then the independence (and uniformity) assumption gives us a selectivity of 1/10 * 1/100 = 0.001. But since only Honda makes Accords, by trademark law, the real selectivity is 0.01. So we will under-estimate the cardinality by an order of magnitude. Such optimistic errors are much worse than pessimistic over-estimation errors, because they cause the optimizer to think that certain operations will be cheaper than they really are, causing nasty surprises at run time. The only way to avoid such errors is for the database administrator to be aware of the semantic relationship (a functional dependency, in this case) between those two columns and its consequences, and to collect column group statistics, as DB2 and other database products now allow.
To identify these landmines in the schema automatically, Stillger et al. [Stil 01] developed the LEarning Optimizer (LEO), which opportunistically and automatically compared run-time actual cardinalities to optimizer estimates, to identify column combinations exhibiting such correlation errors. Ilyas et al. [Ilya 04] attacked the problem more pro-actively in CORDS (CORrelation Detection by Sampling), searching somewhat exhaustively for such correlations between any two columns in samples from the data before running any queries. And Markl and colleagues [Mark 05], [Mark 07] have made ground-breaking advances on a consistent way to combine the selectivities of conjuncts in partial results.
All great progress on this problem, but none yet solves the problem of redundant predicates that can be inadvertently introduced by the query writer who typically believes that “more is better”, that providing more predicates helps the DBMS do its job better – it’s American as Apple Pie! Let me illustrate with one of my favorite war stories.
At a meeting of the International DB2 User’s Group, a chief database administrator for a major U.S. insurance company whom I’d helped with occasional bad plans asked me to conduct a class on-site. I suggested it include an exercise on a real problem, unrehearsed. After my class, she obliged me by presenting two 1-inch stacks of paper, each detailing the EXPLAIN of a plan for a query. I feared I was going to embarrass myself and fail miserably under the gun. The queries differed in only one predicate, she said, but the original ran in seconds whereas the one with the extra predicate took over an hour.
I instinctively examined first the cardinality estimates for the results of the two, and the slower one had a cardinality estimate 7 orders of magnitude less than the fast one. When asked what column the extra predicate was on, my host explained that it was a composite key constructed of the first four letters of the policy-holder’s last name, the first and middle initials, the zip code, and last four digits of his/her Social Security Number. Did the original query have predicates on all those columns? Of course! And how many rows were there in the table? Ten million. Bingo! I explained that that predicate was completely redundant of the others, and its selectivity, 1/107, when multiplied by the others, underestimated the cardinality by 7 orders of magnitude, wreaking havoc with the plan. It took me maybe 5 minutes to figure this out, and I was immediately dubbed a “genius”, but it really was straightforward: the added predicate might help the run-time, especially if they had an index on that column, but it totally threw off the optimizer, which couldn’t possibly have detected that redundancy without LEO or CORDS.
So c’mon, folks, let’s attack problems that really matter, those that account for optimizer disasters, and stop polishing the round ball.
Disclaimer: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.
| Blogger’s Profile:
Dr. Guy M. Lohman is Manager of Disruptive Information Management Architectures in the Advanced Information Management Department at IBM Research Division’s Almaden Research Center in San Jose, California, where he has worked for 32 years. He currently manages the Blink research project, which contributed BLU Acceleration to DB2 for Linux, UNIX, and Windows (LUW) 10.5 (GA’d 2013) and the query engine of the IBM Smart Analytics Optimizer for DB2 for z/OS V1.1 and the Informix Warehouse Accelerator products (2007-2010). Dr. Lohman was the architect of the Query Optimizer of DB2 LUW and was responsible for its development from 1992 to 1997 (versions 2 – 5), as well as the invention and prototyping of Visual Explain, efficient sampling, the DB2 Index Advisor, and optimization of XQuery queries in DB2. Dr. Lohman was elected to the IBM Academy of Technology in 2002 and made an IBM Master Inventor in 2011. He was the General Chair for ACM’s 2013 Symposium on Cloud Computing and is the General Co-Chair of the 2015 IEEE International Conference on Data Engineering (ICDE). Previously, he was on the editorial boards of the “Very Large Data Bases Journal” and “Distributed and Parallel Databases”. He is the author of over 75 papers in the refereed academic literature, and has been awarded 39 U.S. patents. His current research interests involve disruptive machine architectures for Business Intelligence, advanced data analytics, query optimization, self-managing database systems, information management appliances, database compression, and autonomic problem determination.
After hanging out exclusively with the database community for 35+ years, I’ve recently become more involved with the systems research community. I have a few observations and recommendations to share.Much of the work published in systems conferences covers topics that would have a natural home in database conferences. For example, transactions and data streams are currently in vogue in the systems field. Here are some recent Best-Paper-award topics: data streams at SOSP 2013, data-parallel query processing at OSDI 2008, and transactions OSDI 2012 and SOSP 2007. Although this topic overlap is increasing lately, it is not just a recent phenomenon. In the last 10+ years, the systems community has taken an interest in data replication, fault tolerance, key-value stores, data-parallel processing (i.e., map-reduce), query processing in sensor networks, and record caching. Some of this work has had a major impact on the database field, e.g., map-reduce and multi-master replication.Yet despite this overlap of topics, there has been remarkably little attendance of database researchers at systems conferences or of systems researchers at database conferences. Not a good thing. And until 2013, I’ve been part of the problem.
To help break down this barrier, SIGMOD and SIGOPS have sponsored the annual Symposium on Cloud Computing (SoCC) since 2010. Personally, I’ve found attending SoCC’s to be time well spent, and I had a good experience attending ICDCS 2013 in July. So I decided to attend my first SOSP in October, 2013. It was great. The fraction of that program of direct interest to me was as large as any database conference, including work on transactions, data streams, replication, caching, fault tolerance, and scalability. I knew there wouldn’t be a huge number of database attendees, but I wasn’t prepared to be such an odd duck. At most a dozen attendees out of 600 were card-carrying members of the database community. With few database friends to hang out with, I got to meet a lot more new people than I would at a database event and hear about projects in my areas that I’d wouldn’t otherwise have known about. Plus, I could advertise my own work to another group of folks.
If you do systems-oriented database research, then how about picking one systems conference to attend this year? NSDI in Seattle in April, ICDCS in Madrid in late June, or OSDI in Colorado in October. And if you see a systems researcher looking lonely at a database conference, then strike up a conversation. If they feel welcome, perhaps they’ll tell others and we’ll see more systems attendees at database events.
Given the small overlap of attendees in systems and database conferences, we shouldn’t be surprised that the communities have developed different styles of research papers. Since systems papers typically describe mechanisms lower on the stack, they often focus on broader usage scenarios. They value the “ities” more than database papers, e.g., scalability, availability, manageability, and security. They favor simple ideas that work robustly over complex techniques that report improvements for some inputs. They expect a paper to work through the system details in a credible prototype, and to report on lessons learned that are more broadly applicable. They expect more micro-benchmarks that explain the source of performance behavior that’s observed, rather than just benchmarks that model usage scenarios, such as TPC.
From attending systems conferences, serving on SoCC program committees (PC’s), and submitting an ill-fated paper to a systems conference, I learned a few things about systems conferences that we database folks could learn from:
1. More of their conferences are single-track, and they leave more time for Q&A. This leads to more-polished presentations. When you’re presenting a paper to 600 attendees, doing it badly is a career-limiter.
I’ve always liked CIDR, not only because of its system-building orientation, but also because it’s single-track. I’m forced to attend sessions on topics I normally would ignore, and hence I learn more. I think we should try making one of our big conferences single-track: SIGMOD, VLDB or ICDE. One might argue that too many excellent papers are submitted, so it’s impractical. I disagree. Here’s one way to do it: Have the same acceptance rate as in the past, and all papers are published as usual. However, the PC selects only one-track’s worth of papers for presentation slots. The other papers are presented in poster sessions that have no competing parallel sessions. The single-track enables us to learn about more topics, the density of great presentations is higher, and the poster sessions enable us to dig deep on papers of interest to us (as some of our DB conferences already enable us to do). Unless our field stops growing, we really have to do something like this. Our conferences have already gotten out of hand with as many as seven parallel sessions.
2. When describing related work, systems papers cast a wider net. Their goal is often not simply to demonstrate that the paper’s contributions are novel, but also to educate the reader about lines of work that are loosely related to the paper’s topic. To help enable this breadth, they’ve recently settled on a formula where references don’t count toward the paper’s maximum page count. We should adopt this view in the database community. One self-serving benefit is that papers would cite more references, which would increase all of our citation counts.
3. There’s a widespread belief that systems PC’s write longer, more constructive reviews. I think DB PC’s have been improving in this respect, but on average we’re not up to the systems’ standard.
4. They usually shepherd all papers, to ensure that authors incorporate recommended changes. This also gives the authors someone to ask about ambiguous recommendations. We should do this too.
5. They typically require 10pt font. As a courtesy to those of us over a certain age, whose eyes aren’t what they used to be, we should do this, and increase the page count to maintain the same paper length.
6. At SOSP and OSDI, they post all slide decks and videos of all presentations. Obviously worthwhile. We sometimes post slide decks and videos only for plenaries. Let’s do better.
There are a few aspects of systems conferences that I’m less enthusiastic about.
1. Systems conferences favor live PC meetings. They do have benefits. It’s educational for PC members to hear discussions of papers they didn’t review and for young researchers to see how decisions are reached. Since all PC members hear every paper discussion, non-reviewers can bring up points that neutralize inappropriate criticisms or raise issues that escaped reviewers’ attention, which reduces the randomness of decisions. However, there are also negatives. In borderline cases, the most articulate, quick-thinking, and extroverted PC members have an inappropriate advantage in getting their way. And inevitably, some PC members can’t attend the meeting, due to a schedule conflict or the travel expense, so the papers they reviewed get short shrift.
On balance, I don’t think the decisions produced by live PC meetings are enough better than on-line discussions to be worth what they cost. Perhaps we can get some of the benefits of a PC meeting by allowing PC members to see the reviews and discussions of all papers for which they don’t have a conflict, late in the discussion period (which we did some years ago). With large PC’s, this might constitute public disclosure, in which case patents would have to be filed before submission, which will delay the publication of some work. But if we think a broader vetting of papers is beneficial, this might be worth trying.
2. During the reviewing process of a systems conference, if a paper seems borderline, then the PC chairs solicit more reviews. These reviews do offer a different perspective on the interest-value of the work. But they are usually not as detailed as the initial ones and may be by less-expert reviewers. It’s common to receive 5 or 6 reviews of a paper submitted to a systems conference. As an author it’s nice to get this feedback. And it might reduce the randomness of decisions. But it significantly increases the reviewing load on PC’s. In database PC’s, we rarely, if ever, do this. In my opinion, it’s usually better to hold reviewers’ feet to the fire and make them decide. Still, we’d benefit from getting 4 or 5 reviews more often than never, but not as often as systems conferences.
3. Systems conferences publish very few industry papers and rarely have an industry track. In my experience, an industry paper is judged just like a research paper, but with a somewhat lower quality bar. This is unfortunate. Researchers and practitioners benefit from reading about the functionality and internals of state-of-the-art products, even if they are only modestly innovative, especially if they are widely used. The systems community would serve their audience better by publishing more such papers.
4. The reviewing load for systems PC’s is nearly double that of DB PC’s, so systems PC’s are proportionally smaller. E.g., SOSP 2013 had 160 submissions and 28 PC members. If each paper had three reviews, that’s 17 reviews per PC member. With five reviews/paper, that’s 29 reviews per PC member. If DB conferences cranked up the reviewing load, then we’d probably participate in fewer PC’s, so the workload on each of us might be unchanged. A PC member would see more of the submissions in each area of expertise, which might help reduce the randomness of decisions … maybe. Overall, I don’t see a compelling argument to change, but it’s debatable.
5. Like some DB conferences, many systems conferences have adopted double-blind reviewing. I am not a fan. For papers describing a system project, I find it hard to review and, in some cases, impossible to write a paper for a double-blind conference. I have chosen not to submit some papers to double-blind conferences. I know I am not alone in this.
In summary, the systems and DB areas have become much closer in recent years. Each community has something to learn from the other area’s technical contributions and point-of-view and from its approach to conferences and publications. We really would benefit from talking to each other a lot more than we do.
Acknowledgments: My thanks to Gustavo Alonso, Surajit Chaudhuri, Sudipto Das, Sameh Elnikety, Mike Franklin, and Sergey Melnik for contributing some of the ideas in this blog post.
| Blogger’s Profile:
Philip A. Bernstein is a Distinguished Scientist at Microsoft Research. Over the past 35 years, he has been a product architect at Microsoft and Digital Equipment Corp., a professor at Harvard University and Wang Institute of Graduate Studies, and a VP Software at Sequoia Systems. During that time, he has published papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, which are still the major focus of his research. He is an ACM Fellow, a winner of the ACM SIGMOD Innovations Award, a member of the Washington State Academy of Sciences and a member of the National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto.
In 2011, I was traveling on the Kolkata Metro when I made an interesting observation. Nearly every passenger around me was using a computing device of some sort. Some were playing with their smartphones, some with their tablets, some listening to music on their iPod Touches. Each was a consumer of structured data of some sort; interacting with their email, Facebook, or music catalog. Weeks later, while on a flight, this pattern came up again — every seat in the plane was equipped with a touch-driven entertainment system, some used while sifting through book pages on e-Readers. Clearly, the notion of a “computer” for the future generation looks very different from the past!
Over time, we seem to be converging to a vision where we are surrounded by computing devices of varying capacities, with users performing both simple and complex tasks, with a common attribute — these devices do not possess keyboards. In 2011, smartphones and tablets outsold workstations and portable computers by 1.5x. In 2013, this ratio is projected to reach 4x. Based on this trend, non-keyboard interaction (typically in the form of gestures) is on-track to be the dominant mode of data interaction.
Gesture-driven applications pose a very different workload to underlying databases than traditional ones. Consider the idea of scrolling through a page of text, which is typically implemented as a
Another example of an interesting gestural interaction is that of the Interactive Join. As part of the GestureDB system we are developing in my group, the GestureQuery multitouch interface lets you compose two relations into an equijoin by simply dragging them together on the specific attributes:
In the interactive join, assuming all attributes are of the same data type, there could be M x N possible
The following interactive demo lets you play around with a real multitouch interaction trace from one of our user studies, collected from interactions with our GestureQuery iPad prototype2. The subject was asked to perform an equijoin
This interaction is an great example of query intent transition. As the gesture is being articulated, the distribution changes rapidly. The “Proximity Score” is the score for
Just like the previous example, challenges arise here. Can a database provide join results at such low latency for such a diverse number of queries being fired at the same time, given such rapidly changing probabilities? It is not practical for the database to execute them all in parallel. Given that we know the domain of possible queries, is there scope for multiquery optimization? Second, given an interactive threshold of say 100ms for the results, is it feasible (or even useful) to generate full results for all queries? Should we consider partial query execution? Or can we simply sample the tables to provide interactive (albeit inaccurate) results? Third, feedback for a query is useful only for a fleeting moment during the gesture articulation. How could we show parts (i.e. rank tuples) of the result that are most useful?
Clearly, when working with gesture-driven interfaces and application layers, the gestural workloads of database queries pose a smorgasbord of challenges, that might require rethinking of the entire database stack. As these interfaces evolve, the underlying database systems will have to keep up with changes in user expectations and demands. With the explosion of new and novel gestural interfaces, natural interfaces and casual computing lately, it’s an exciting time for database research!
| Blogger’s Profile:
Arnab Nandi is an Assistant Professor of Computer Science & Engineering at The Ohio State University. Arnab’s research is in the area of database systems, focusing on challenges in big data analytics and data interaction. The goal of his group’s work is to allow end-users to explore and interact with semi-structured and structured data. Prior to joining Ohio State, Arnab did his Ph.D. at the University of Michigan, Ann Arbor, with stints at Google, Yahoo! Research, and Microsoft Research.
“Both theoretical and empirical research may be unnecessarily complicated by failure to recognize the effects of heterogeneity” – Vaupel & Yashin
Big Data is daily topic of conversation among data analysts, with much said and written about its promises and pitfalls. The issue of heterogeneity, however, has received scant attention. This is unfortunate, since failing to take heterogeneity into account can easily derail the discoveries one makes using these data.
This issue, which some may recognize as an example of ecological fallacy, first came to my attention via a paper elegantly titled “Heterogeneity’s ruses: some Surprising Effects of Selection on Population Dynamics” (Vaupel and Yashin, 1985). Authors discuss a variety of examples where the aggregated behavior of a heterogeneous population, composed of two homogeneous but differently behaving subpopulations, will differ from the behavior of any single individual. Consider the following example. It has been observed that the recidivism rate of convicts released from prison declines with time. A natural conclusion one may reach from this observation is that former convicts are less likely to commit crime as they age. However, this is false. In reality, there may be two groups of individuals “reformed” and “incorrigible” with constant – but different – recidivism rates. With time, there will be more “reformed” individuals left in the population, as the “incorrigibles” are sent back to prison, resulting in decreasing recidivism rate for the population as a whole. This simple example shows that “the patterns observed [at population level] may be surprisingly different from the underlying patterns on the individual level. Researchers interested in uncovering these individual patterns, perhaps to help develop or test theories or to make predictions, might benefit from an “understanding of heterogeneity’s ruses.” (Vaupel & Yashin)
My colleagues and I have been tricked by heterogeneity time and again. As one example, our study of information spread on the follower graphs of Twitter and Digg revealed that it was surprisingly different from the simple epidemics that are often used to model information spread. In a simple epidemic, described, for example, as an independent cascade model, the probability of infection increases monotonically with the number of exposures to infected friends. This probability is measured by the exposure response function. The figure below shows the exposure response function we measured on Twitter: the probability for becoming infected (i.e., retweet) information (a URL) on Twitter as a function of how many friends had previously tweeted this information. In contrast to epidemics, it appears as though repeated exposure to information suppresses infection probability. We measured an even more pronounced suppression of infection on Digg [Ver Steeg et al, 2011], and a similar exposure response was observed for hashtag adoption following friends’ use of them [Romero et al, 2011].
It is easy to draw wrong conclusions from this finding. In “What stops social epidemics?” [Ver Steeg et al, 2011], we reported that information spread on Digg is quickly extinguished, and attributed this to the exposure response function. We speculated that initial exposures “inoculate” users to information, so that they will not become infected (i.e., propagate it) despite multiple exposures. Now we know this explanation was completely wrong.
The exposure response function, while aggregated over all users, does not describe the behavior of any individual Digg or Twitter user – even the hypothetical “typical” user. In fact, there is no “typical” Twitter (or Digg) user. Twitter users are extremely heterogeneous. Separating them into more homogeneous sub-populations reveals a more regular pattern. Figure 2 shows the exposure response function for different populations of Twitter users, separated according to the number of friends they follow (large fluctuations are the result of small sample size). Why number of friends? This is explained in more detail in our papers [Hodas & Lerman 2012, 2013], but in short, we found it useful to separate users according to their cognitive load, i.e., the volume of information they receive, which is (on average) proportional to the number of friends they follow [Hodas et al, 2013]. Now, the probability that a user within each population will become infected increases monotonically with the number of infected, very similar to the predictions of the independent cascade model.
Figure 2 has a different, more significant interpretation, with consequences for information diffusion. It suggests that highly connected users, i.e., those who follow many others, are less susceptible to becoming infected. Their decreased susceptibility in fact explains Figure 1: as one moves to the right of the exposure response curve, only the better connected, and less sensitive, users contribute to that portion of the response. However, despite their reduced susceptibility, highly connected users respond positively to repeated exposures, like all other users. You do not inhibit response by repeatedly exposing people to information. Instead, the reason that these users are less susceptible hinges on the human brain’s limited bandwidth. There are only so many tweets any one can read, the more tweets you receive (on average proportional to the number of friends you follow), the less likely you are to see – and retweet – any specific tweet. If it was not for recognizing heterogeneity, we would not have found this far more interesting explanation.
| Blogger’s Profile:
Kristina Lerman is a Project Leader at the University of Southern California Information Sciences Institute and holds a joint appointment as a Research Associate Professor in the USC Computer Science Department. After a brief stint as a theoretical roboticist, she found her calling in blending together methods from physics, computer science and social science to address problems in social computing and social media analysis. She writes many papers that are greatly enjoyed by all of their twenty readers.
The title I chose is triggered by the lack of concern shown by computer scientists when public disclosures that companies like Apple, Google, HP own pay amazingly low taxes. These companies are derided in the press [look here for references]. The critiques provide little background, and often just repeat assertions that it’s all perfectly legal. Why? Corporate taxes are levied on reported US income, and their income is a share of the sales of software and devices incorporating software. That share can be greatly reduced by showing that the rights to the intellectual property (IP), embodied in the products, are held offshore. All multinational computer enterprises can and actually do avoid much taxation by having transferred a large fraction of the rights to their IP to shell companies located in taxhavens, typically subsidiaries of their offshore operations. The true value of such transfers is hard to assess, especially if the IP transfers are made before the products are offered for sale.
Since the IP in their products comprises most of the value of the revenues, most of the profits from selling or exploiting software now flow into taxhavens. Other costs of production, as assembling hardware, duplicating CDs, or shipping products over the Internet are commodity operations, adding little value. At the same time, the actual intellectual property, and the employees that create and improve that IP are praised, as in Apple’s label “Designed in California”, while its valuation is left to accountants, without input by the creators.
It is true that the value of intellectual capital, the creators and the IP they generate, is hard to measure. But being hard has not stopped innovators from trying to solve a problem, and the principles are simple, the financial value of any good is what people are willing to pay for it in the future, discounted to today.
The low financial value assessed to IP transfers to taxhavens mirrors what bookkeepers do. The reported book value of a company is obtained by adding up all visible assets and the financial capital resources of a company, both in the US and held offshore. Investors know better. Shareholders value the company at a market capitalization, the product of the share price and the number of shares in the market. Those market values are 3 to 6 times more than the book values of modern companies. To a company’s shareholders it may not matter that the financial capital and IP rights are held in offshore mailbox accounts, they care about future income. The total amount of capital held offshore by high-technology multinationals is now over 10% of the annual U.S. GDP. Some large companies hold enough that just the taxes they avoided could pay for a year of NSF and DARPA grants combined.
Many computer scientists believe that software – their output – should be a free and common good. But only a fraction of them work pro bono or for organizations supported by grants. The remainder have not convinced the companies they work for to forego income. Avoiding taxes not only a US issue, taxes on IP-based income are avoided worldwide. Operations of multinationals in India or Ireland do not generate profits due to the intellectual capital employed locally either, those rights are also allocated to taxhaven accounts. For instance the low 12.5% Irish tax rate is only applied to a, say 5%, commodity markup of products assembled there. Having good jobs also in countries that want to move up the economic ladder is, of course, a complementary and worthwhile objective.
Should computer professionals care about the exploitation of their work? Other creative disciplines are aware of the value and dispersion of their work, certainly architects, automobile designers, etc. Those professionals do not expect that the results of their work should be free. Nor do they expect that that their employers should escape taxation when the outcome of their work is sold.
Computer scientists do lobby for better and wider education, for computers for every student, for research funding for their projects – all to be paid by taxes on everything and everybody except their products. Government initiatives to help innovation are now hard to motivate, since the models used by economists ignore the contributions of intellectual capital, while focusing on interest rates and incentives that were meaningful 50 years ago.
The SIGMOD blog is not intended for any ongoing discussions, but comments and feedback will be welcome, and indicate that some of us care.
| Blogger’s Profile:
Gio Wiederhold is professor emeritus of Computer Science at Stanford University, with courtesy Appointments in Electrical Engineering and Medicine. He has over 400 publications in these fields. Gio holds a PhD from the University of California, San Francisco and an honorary D.Sc. from the National University of Ireland, Galway. He has been elected as a fellow of the ACM, the IEEE, and the ACMI.
Gio was born in Italy and lived in Germany and The Netherlands. He emigrated to the United States in 1958. He worked sixteen years in industry as a programmer, software designer, manager, and division director. He became a professor at Stanford in 1976, advising PhD students in several departments, integrating concepts from multiple disciplines. His students now have responsible positions in finance, academia, and industry. He currently teaches two courses at Stanford: Business on the Internet and Software Economics.
Breaks from Stanford include tours as visiting professor at IIT Kanpur in India, at EPFL in Switzerland, representing UNDP at projects in Asia, as a researcher at IBM Germany, and as program manager at DARPA of the US Department of Defense, where he initiated the Intelligent Integration of Information (I3) program. I3 supported innovative programs. as the Digital Library, leading to several significant Internet applications and companies, including Google. For the last 10 years Gio consulted exclusively for the US Treasury on technical and business aspects of valuation of intellectual property. He recently published a book: Valuing Intellectual Capital, Multinationals and Taxhavens; series Management for Professionals, Springer Verlag, New York , August 2013, integrating technical, business, and economic concepts to allow consistent valuation of IP while demonstrating the problems due to inadequate valuation of rights to high-technology exports, as described in an earlier paper with Amar Gupta, and Erich Neuhold: “Offshoring and Transfer of Intellectual Property”; Information Resources Management Journal (IRMJ) Vol.23 No.1, January-March 2010, pp.74-93.
Like all computer science and engineering departments, at Arizona State University (ASU), we regularly re-assess the impacts of our research and educational programs based on feedback from users of the technologies we develop and companies that hire our graduates.
Last January, as part of the most recent of these efforts, we (in collaboration with IBM) organized a Data Services and Analytics Roundtable at ASU to discuss the challenges, approaches, and needs of the industry with respect to data services and analytics. The roundtable brought together executives, researchers, and engineers representing various industry sectors (including big data producers, consumers, and technology and service providers). Participants included representatives from the aerospace and defense industry (e.g. AeroJet), local government (e.g. City of Phoenix), technology representatives that regularly handle big data sets (e.g. Avnet, GoDaddy, and IO Data Centers), and data solution developers (e.g. IBM and Oracle).
In order to help contextualize the day’s discussions and to have a baseline understanding of the participants view of the state of the art, as the very first task of the day, we presented the participants with 35 data services and analytics knowledge competency areas we had identified in advance and asked the participants to rate these areas in terms of:
We also asked the participants whether they are currently involved in technology development in any of these areas (the list of the areas knowledge competency areas we considered is included below).
Needlessly to say, the list of knowledge competency areas we presented to the participants was unavoidably imperfect and incomplete. Moreover, the discussions and results that have emerged from the participants’ ratings represented the biases and perceptions of the specific group of industry representatives that participated in the roundtable. Nevertheless, I think the list of six most critical knowledge competency groups (in terms of the value gap – i.e., the difference between current and desired states of the knowledge area) that emerged out of these ratings is quite interesting.
Six most critical knowledge competency groups
1. temporal and spatial data analyses,
2. summarization, cleaning, visualization, anomaly detection,
3. real-time processing for streaming data,
4. representations and fusion for unstructured/structured data, semantic Web,
5. graph-based models, social networks, and
6. performance and scalability, distributed architectures.
First of all, while we were expecting to see temporal and spatial data analyses somewhere in the short list, it was rather surprising to have these two at the very top of the list. Given that spatial and temporal data analyses are probably among the two most well-studied areas with long histories of innovation, it was interesting to see that the folks at this roundtable thought that the state-of-the-art is still way off where it should be by now.
It was also interesting that, for this group of representatives, “performance” and “scalability” (if not the need for “real-time” processing of data in motion) were somewhat less critical than the other key issues, such as summarization, cleaning, visualization, and integration. It looked liked the message we were being given was “yes the data scale is increasingly critical — but not just because it is too difficult to process all these data, but primarily because users are still not satisfied with the tools that are available to make sense out of the data.”
After this, the roundtable continued with sessions where groups of participants further discussed the challenges and opportunities in data services and analytics. Below is a very brief summary of the outcomes that emerged from these discussion sessions:
The amount of data being generated is massive. This necessitates new data architectures with lots of processing power and tools (including collaboration tools) that can match the scale of the data and support split second decision making, through data fusion and integration and analysis and forecasting algorithms, to help non-data-experts (both government and commercial) make decisions and generate value.
Data in motion
While batch processing techniques are acceptable for “data at rest”, for many business needs, the architectures have to support on-line and pipelined data processing to match the speed of the application and business. In particular real-time of processing of temporally and spatially distributed observations for “data in motion” is needed by many applications, including those that include human behavior modeling at individual and population scales. Credit card authentication, fraud detection systems financial services, retail, mobile/location-aware services, and healthcare are just a few examples of applications with this critical need. Tools that support entity analytics, (social and other) network analytics, text analytics, and media analytics are in increasing demand, not only for traditional applications like monitoring and security, but also for emerging applications, including enabling interest detection for retail/advertisement, social media, and SmartTV. Privacy and security are also key properties of many of these applications.
Federated data storage, analysis, and modeling
Federated data storage, analysis, and modeling are still critical for many businesses. Given that many government agencies and business opt to deploy common IT infrastructures and that most incoming data is unstructured (or at least differently structured), frameworks that can make unstructured data queriable, prioritize and rank data, correlate and identify the gaps in the data, highlight what is normal and not normal, and automate the ingest of the data into these common infrastructures can also have significant impact. These data architectures need to also support reducing the size or dimensionality of the data to make data amenable to analysis. The novel learning algorithms need to be able to take into account for known models, but also adapt the models to new emerging patterns. Underlying systems should also support going back in history to validate models and going forward into future to support forecasting and if-then hypothesis testing.
A new crop of data scientists
The above clearly necessitates a new crop of data scientists with solid algorithmic and mathematical background, complemented with excellent data management, programming, and system development/integration skills. Data visualization and media information extraction are also important technical skills. These data scientists should also be able to identify data that is important, restructure data to make it useful, interpret data, formulate observation strategies and relevant data queries, and ask new questions based on the observations and results – including “what happened?”, “why did it happen?”, and “what happens next?”.
No more teaching-one-DBMS paradigm
Increasingly no one single data management and analysis technology is able to address the needs of all applications; moreover technologies (both in application side as well as in the data management space) evolve rapidly. Therefore, a program educating these data scientists need to shy away from teaching one DBMS or one data management paradigm (e.g. row-stores, column-stores), but need to teach fundamentals for the graduates to pick and deploy the appropriate DBMS with the suitable structured or unstructured data model for the particular task and business needs. The program should not only teach the currently dominant technologies, but also introduce up-and-coming as well as legacy technologies and provide students with the skills necessary to keep up with the technology development. In particular, the program should address to bridge the ever increasing knowledge gap due to the proliferation of data management, processing, and analysis systems (including commercial and open-source) and help data scientists to make informed architectural decisions based on a good understanding on how available technologies differ and complement each other.
Multi-disciplinary, business-aware program
A key skill for these new data scientists, beyond their technical excellence, is an overall awareness of how businesses operate and appreciation for the impact of the data on the businesses. Moreover, to maximize impact, these data scientists need to have the necessary skills to communicate with non-technical co-workers, including business executives. Such a program needs to be inherently multi-disciplinary and include research, case studies, and representatives from the industry and government to provide a holistic perspective in research and teaching and have direct local and national impact.
Many of these observations are of course not surprising for those who are well-tuned with the needs of the industry and other user groups. Nevertheless, it was quite instructive that very consistent outcomes emerged from participant groups with diverse backgrounds. It was also interesting (both from educational as well as technological points of view) that participants highlighted that non-technical skills of data scientists and engineers, such as “collaboration” and “technical communication (including with non-technical co-workers)” are almost as critical as their technical skills. After all, many organizations are as complex and the usage contexts of the data are as diverse as the data themselves. Moreover, in a complex and layered organization, one’s learned result is nothing but the other’s raw data. Thus, an individual’s or group’s ability to make sense out of data may not be all that useful for a business if that individual or group does not also have the necessary skills and supporting tools to contextualize and communicate the learned results to the others in the organization.
Naturally, the outcomes of this roundtable will serve as a guide as we continue to develop our data management and analytics research and educational programs at ASU.
However, equally naturally, we are aware that no one single roundtable or workshop can unearth all the challenges and opportunities in a domain as diverse and fast developing as data management, services, and analytics. Thus, I (and I am sure many others following this blog) would love to hear your views and opinions.
| Blogger’s Profile:
K. Selçuk Candan is a Professor of Computer Science and Engineering at the Arizona State University. His primary research interests are in the area of efficient and scalable management of heterogeneous (such as sensed and streaming media, web and social media, scientific, and enterprise) data. He has published over 160 journal and peer-reviewed conference articles, one book, and 16 book chapters. He has 9 patents. He served as an associate editor of the Very Large Databases (VLDB) journal.
He is also in the editorial board of the IEEE Transactions on Multimedia and the Journal of Multimedia. He has served in the organization and program committees of various conferences. In 2006, he served as an organization committee member for SIGMOD’06. In 2008, he served as a PC Chair for ACM Multimedia (MM’08). In 2010, he served as a program committee group leader for ACM SIGMOD’10 and a PC Chair for the ACM Int. Conf. on Image and Video Retrieval’10. In 2011, he served as a general co-chair for the ACM MM’11 conference. In 2012 he served as a general co-chair for ACM SIGMOD’12. He has successfully served as the PI or co-PI of numerous grants, including from the National Science Foundation, Air Force Office of Research, Army Research Office, Mellon Foundation, and HP Labs. He also served as a Visiting Research Scientist at NEC Laboratories America for over 10 years. He is a member of the Executive Committee of ACM SIGMOD and an ACM Distinguished Scientist.
Figure 1 shows a portion of a relational table contained in a real, large information system. The table concerns the customers of an organization, where each row stores data about a single customer. The first column contains her code (if the code is negative, then the record refers to a special customer, called “fictitious”), columns 2 and 3 specify the time interval of validity for the record, ID_GROUP indicates the group the customer belongs to (if the value of FLAG_CP is “S”, then the customer is the leader of the group, and if FLAG_CF is “S”, then the customer is the controller of the group), FATTURATO is the annual turnover (but the value is valid only if FLAG_FATT is “S”). Obviously, each notion mentioned above (like “fictitious”, “group”, “leader”, etc.) has a specific meaning in the organization, and understanding such meaning is crucial if one wants to correctly manage the data in the table and extract information out of it. Similar rules hold for the other 47 columns that, for lack of space, are not shown in the figure.
Figure 1: A portion of the Customer table in a database of a large organization.
Those who have experience of large databases, or databases that are part of large information systems will not be surprised to see such complexity in a single data structure. Now, think of a database with many tables of this kind, and try to imagine a poor final user accessing such tables to extract useful information. The problem is even more severe if one considers that information systems in the real world use different (often many) heterogeneous data sources, both internal and external to the organization. Finally, if we add to the picture the (inevitable) need of dealing with big data, and consider in particular the two v’s of “volume” and “velocity”, we can easily understand why effectively accessing, integrating and managing data in complex organizations is still one of the main issues faced by IT industry nowadays.
Issues in governing complex information systems
The above is a simple example motivating the claim that governing the resources (data, meta-data, services, processes, etc.) of modern information systems is still an outstanding problem. I would like to go in more detail on three important aspects related to this issue.
Accessing and querying data. Although the initial design of a collection of data sources might be adequate, corrective maintenance actions tend to re-shape them into a form that often diverges from the original structure. Also, they are often subject to changes so as to adapt to specific, application-dependent needs. Analogously, applications are continuously modified for accommodating new requirements, and guaranteeing their seamless usage within the organization is costly. The result is that the data stored in different sources and the processes operating over them tend to be redundant, mutually inconsistent, and obscure for large classes of users. So, query formulation often requires interacting with IT experts who know where the data are and what they mean in the various contexts, and can therefore translate the information need expressed by the user into appropriate queries. It is not rare to see organizations where this process requires domain experts to send a request to the data management staff and wait for several days (or even weeks, at least in some Public Administrations in Italy…) before they receive a (possibly inappropriate) query in response. In summary, it is often exceedingly difficult for end users to single out exactly the data that are relevant for them, even though they are perfectly able to describe their requirement in terms of business concepts.
Data quality. It is often claimed that data quality is one of the most important factors in delivering high value information services. However, the above-mentioned scenario poses several obstacles to the goal of even checking data quality, let alone achieving a good level of quality in information delivery. How can we possibly specify data quality requirements, if we do not have a clear understanding of the semantics that data should bring? The problem is sharpened by the need of connecting to external data, originating, for example, from business partners, suppliers, clients, or even public sources. Again, judging about the quality of external data, and deciding whether to reconcile possible inconsistencies or simply adding such data as different views, cannot be done without a deep understanding of their meaning. Note that data quality is also crucial for opening data to external organizations. The demand of greater openness is irresistible nowadays. Private companies are pushed to open their resources to third parties, so as to favor collaborations and new business opportunities. In public administrations, opening up data is underpinning public service reforms in several countries, by offering people informed choices that simply have not existed before, thus driving towards improvements in the services to the citizens.
Process and service specification. Information systems are crucial artifacts for running organizations, and organizations rely not only on data, but also, for instance, on processes and services. Designing, documenting, managing, and executing processes is an important aspect of information systems. However, specifying what a process/service does, or which characteristics it is supposed to have, cannot be done correctly and comprehensively without a clear specification of which data the process will access, and how it will possibly change such data. The difficulties of doing that in a satisfactory way come from various factors, including the lack of modeling languages and tools for describing process and data holistically. However, the problems related to the semantics of data that we discussed above undoubtedly make the task even harder.
A (new?) paradigm
In the last five years, I (and my group in Roma) have been working on a new paradigm addressing these issues, based on the use of knowledge representation and reasoning techniques, and I want to share my excitement about it with the readers of this blog. The paradigm is called “Ontology-based Data Management” (OBDM), and requires structuring the information system into four layers.
The distinguishing feature of the whole approach is that users of the system will be freed from all the details of how to use the resources, as they will express their needs in the terms of the DKB. The system will reason about the DKB and the mappings, and will reformulate the needs in terms of appropriate calls to services provided by resources. Thus, for instance, a user query will be formulated over the domain ontology, and the system will reason upon the ontology and the mappings to call suitable queries over data sources that will compute the answers to the original user query.
As you can see, the heart of the approach is the DKB, and the core of the DKB is the ontology. So, what is new? Indeed, I can almost hear many of you saying: what is the difference with data integration (where the global schema plays the role of the ontology)? And what is the difference with conceptual modeling (where the conceptual schema plays the role of the ontology)? And what about Knowledge Representation in AI (where the axioms of the knowledge base play the role of the DKB)? The answer is simple: almost none. Indeed, OBDA builds on all the above disciplines (and others), but with the goal of going beyond what they currently provide for solving the problems that people encounter in the governance of complex information systems. At the same time, there are a few (crucial, at least for me) facts that make OBDA a novel paradigm to experiment and study. Here is a list of the most important ones:
A lot more to do
A few research groups are experimenting OBDM in practice (see, for example, the Optique IP project, financed by the Seventh Framework Program (FP7) of the European Commission). In Rome, we are involved in applied projects both with Public Administrations, and with private companies. One of the experiences we are carrying out is with the Department of Treasury of the Italian Ministry of Economy and Finance. In this project, three ontology experts from our department worked with three domain experts for six months, and built an ontology of 800 elements, with 3000 DL-Lite axioms, and 800 mapping assertions to about 80 relational tables. The ontology is now used as a common framework for all the applications, and will constitute the main document specifying the requirement for the restructuring of the information system that will be carried out in the next future. We are actually lucky to live in Rome, not only because it is a magnificent city, but also because Italian Public Administrations, many of which are located in the Eternal City, provide perfect examples of all the problems that make OBDM interesting and potentially useful…
The first experiences we have conducted are very promising, but OBDM is a young paradigm, and therefore it needs attention and care. This means that there are many issues to be addressed to make it really effectively work in practice. Let me briefly illustrate some of them. One big issue is how to build and maintain the ontology (and, more generally, the DKB). I know that this is one of the most important criticisms to all the approaches requiring a considerable modeling effort. My answer to these is that all modeling efforts are investments, and when we judge about investments we should talk not only about costs, but also about benefits. Also, take into account that OBDM works in a “pay-as-you-go” fashion: users have interesting advantages even with a very incomplete domain description, as the system can reason about an incomplete specification, and try to get the best out of it. Another important issue is evolution. Evolution in OBDM concerns not only the data at the sources (updates), but also the ontology, and the mappings. Indeed, both the domain description, and the resources continue to evolve, and all the components of the system should keep up with these modifications. Not surprisingly, this is one issue where more research is still desperately needed. Overall, the DKB and the mappings constitute the meta-data of the OBDM system, and in complex organizations, such meta-data can be huge and difficult to control and organize. Talking in terms of a fashionable terminology, with OBDM we face not only the problem of Big Data, but also the problem of Big Meta-Data. Another issue that needs to be further studied and explored is the relationship between the static aspects and the dynamic aspects of the DKB, together with the problem of mapping processes and services specified at the conceptual level to computational resources in applications.
I really hope that this blog have somehow triggered your attention to OBDM, and that you will consider looking at it more closely, for example for carrying out some experiments, or for doing research on at least some of its many open problems that still remain to be studied.
| Blogger’s Profile:
Maurizio Lenzerini is a full professor in Computer Science and Engineering at the University of Rome La Sapienza, where he is leading a research group on Databases and Artificial Intelligence. His main research interests are in database theory, data and service and integration, ontology languages, knowledge representation and reasoning, and component-based software development. He is a former Chair and a current member of the Executive Committee of ACM PODS (Principles of Database Systems). He is an ACM Fellow, an ECAI (European Coordinating Committee for Artificial Intelligence) fellow, and a member of the Academia Europaea – The Academy of Europe.