Graph data management has seen a resurgence in recent years, because of an increasing realization that querying and reasoning about the structure of the interconnections between entities can lead to interesting and deep insights into a variety of phenomena. The application domains where graph or network analytics are regularly applied include social media, finance, communication networks, biological networks, and many others. Despite much work on the topic, graph data management is still a nascent topic with many open questions. At the same time, I feel that the research in the database community is fragmented and somewhat disconnected from application domains, and many important questions are not being investigated in our community. This blog post is an attempt to summarize some of my thoughts on this topic, and what exciting and important research problems I think are still open.
At its simplest, graph data management is about managing, querying, and analyzing a set of entities (nodes) and interconnections (edges) between them, both of which may have attributes associated with them. Although much of the research has focused on homogeneous graphs, most real-world graphs are heterogeneous, and the entities and the edges can usually be grouped into a small number of well-defined classes.
Graph processing tasks can be broadly divided into a few categories. (1) First, we may to want execute standard SQL queries, especially aggregations, by treating the node and edge classes as relations. (2) Second, we may have queries focused on the interconnection structure and its properties; examples include subgraph pattern matching (and variants), keyword proximity search, reachability queries, counting or aggregating over patterns (e.g., triangle/motif counting), grouping nodes based on their interconnection structures, path queries, and others. (3) Third, there is usually a need to execute basic or advanced graph algorithms on the graphs or their subgraphs, e.g., bipartite matching, spanning trees, network flow, shortest paths, traversals, finding cliques or dense subgraphs, graph bisection/partitioning, etc. (4) Fourth, there are “network science” or “graph mining” tasks where the goal is to understand the interconnection network, build predictive models for it, and/or identify interesting events or different types of structures; examples of such tasks include community detection, centrality analysis, influence propagation, ego-centric analysis, modeling evolution over time, link prediction, frequent subgraph mining, and many others [New10]. There is much research still being done on developing new such techniques; however, there is also increasing interest in applying the more mature techniques to very large graphs and doing so in real-time. (5) Finally, many general-purpose machine learning and optimization algorithms (e.g., logistic regression, stochastic gradient descent, ADMM) can be cast as graph processing tasks in appropriately constructed graphs, allowing us to solve problems like topic modeling, recommendations, matrix factorization, etc., on very large inputs [Low12].
Prior work on graph data management could itself be roughly divided into work on specialized graph databases and on large-scale graph analytics, which have largely evolved separately from each other; the former has considered end-to-end data management issues including storage representations, transactions, and query languages, whereas the latter work has typically focused on processing specific tasks or types of tasks over large volumes of data. I will discuss those separately, focusing on whether we need “new” systems for graph data management and on open problems.
Graph Databases and Querying
The first question I am usually asked when I mention graph databases is whether we really need a separate database system for graphs, or whether relational databases suffice. Personally I believe that graph databases provide a significant value for a large class of applications and will emerge as another vertical.
If the goal is to support some simple graph algorithms or graph queries on data that is stored in an RDBMS, then it is often possible to do those using SQL and user-defined functions and aggregations. However, for more complex queries, a specialized graph database engine is likely to be much more user-friendly and likely to provide significant performance advantages. Many of the queries listed above either cannot be mapped to SQL (e.g., flexible subgraph pattern matching, keyword proximity search) or the equivalent SQL is complex and hard to understand or debug. An abstraction layer that converts queries from a graph query language to SQL could address some of these shortcomings, but that will likely only cover a small fraction of the queries mentioned above. More importantly, graph databases provide efficient programmatic access to the graph, allowing one to write arbitrary algorithms against them if needed. Since there is usually a need to execute some graph algorithms or network science tasks in the application domains discussed above, that feature alone makes a graph database very appealing. Most graph data models also support flexible schemas — although an orthogonal issue, new deployments may choose a graph database for that reason.
Whether a specialized graph database provides significant performance advantages over RDBMSs for the functionality common to both is somewhat less clear. For many graph queries, the equivalent SQL, if one exists, can involve many joins and unions and it is unlikely the RDBMS query optimizer could optimize those queries well (especially given the higher use of self-joins). It may also be difficult to choose among different ways to map a graph query into SQL. Queries that require recursion (e.g., reachability) are difficult to execute in a relational database, but are natural for graph databases. Graph databases can also employ specific optimizations geared towards graph queries and traversals. For example, graph databases typically store all the edges for a node with the node to avoid joins, and such denormalization can significantly help with traversal queries, especially queries that traverse multiple types of edges simultaneously (e.g., for subgraph pattern matching). Replicating node information with neighbors can reduce the number of cache misses and distributed traversals for most graph queries (at the expense of increased storage and update costs). Similarly, min cut-based graph partitioning techniques help in reducing the number of distributed queries or transactions, and similar optimizations can be effective in multi-core environments as well. On the other hand, there is less work on query optimization in graph databases, and for simple queries (especially simple subgraph pattern matching queries), the query optimizer in relational databases may make better decisions than many of today’s graph databases.
I think exploring such optimizations and understanding the tradeoffs better are rich topics for further research. For example, how the graph is laid out, both in persistent storage and in memory, can have a significant impact on the performance, especially in multi-core environments. We also need to better understand the common access patterns that are induced by different types of queries or tasks, and the impact of different storage representations on the performance of those access patterns. Another key challenge for graph querying is developing a practical query language. There is much theoretical work on this problem [Woo12] and several languages are currently used in practice, including SPARQL, Cypher, Gremlin, and Datalog. Among those, SPARQL and Cypher are based primarily on subgraph pattern matching and can handle a limited set of queries, whereas Gremlin is a somewhat low-level language and may not be easy to optimize. Datalog (used in LogicBlox and Datomic) perhaps strikes the best balance, but is not as user-friendly for graph querying and may need standardization of some of the advanced constructs, especially aggregates.
Unfortunately I see little work on these problems, and on end-to-end graph databases in general, in our community. There are quite a few graph data management systems being actively built in the industry, including Neo4j, Titan, OrientDB, Datomic, DEX, to name a few, where these issues are being explored. Much of the work in our community, on the other hand, is more narrowly focused on developing search algorithms and indexing techniques for specific types of queries; while that work has resulted in many innovative techniques, for wide applicability and impact, it is also important to understand how those fit into a general-purpose graph data management system.
Large-scale Graph Analytics
Unlike the above scenario, the case of new systems for graph analysis tasks, broadly defined to include graph algorithms and network science and graph mining tasks, is more persuasive. Batch analytics systems like relational data warehouses and MapReduce-based systems are not a good fit for graph analytics as is. From the usability perspective, it is not natural to write graph analysis tasks using those programming frameworks. Some graph programming models (e.g., the vertex-centric programming model [Mal10]) can be supported in a relational database through use of UDFs and UDAs [Jin14]; however it is not clear if the richer programming frameworks (discussed below) can also be efficiently supported. Further, many graph analysis tasks are inherently iterative, with many iterations and very little work per vertex per iteration, and thus the overheads of
those systems may start dominating and may be hard to amortize away.
A more critical question, in my opinion, is whether the popular vertex-centric programming model is really a good model for graph analytics. To briefly recap, in this model, users write vertex-level compute programs, that are then executed iteratively by the framework in either a bulk synchronous fashion or asynchronous fashion using message passing or shared memory. This model is well-suited for some graph processing tasks like computing PageRank or connected components, and also for several distributed machine learning and optimization tasks that can be mapped to message passing algorithms in appropriately constructed graphs [Low12]. Originally introduced in this context in Google’s Pregel system [Mal10], several graph analytics systems are built around this model (e.g., Giraph, Hama, GraphLab, PowerGraph, GRACE, GPS, GraphX).
However, most graph analysis tasks (e.g., the popular modularity optimization algorithm for community detection, betweenness centralities) or graph algorithms (e.g., matching, partitioning) cannot be written using the vertex-centric programming model while permitting efficient execution. The model limits the compute program’s access to a single vertex’s state and so the overall computation needs to be decomposed into smaller local tasks that can be (largely) independently executed; it is not clear how to do this for most of the computations discussed above, without requiring a large number of iterations. Even local neighborhood-centric analysis tasks (e.g., counting motifs, identifying social circles, computing local clustering coefficients) are inefficient to execute using this model; one could execute such a task by constructing multi-hop neighborhoods in each node’s local state by exchanging neighbor lists, but the memory required to hold that state can quickly make it infeasible [Qua14]. I believe these limitations are the main reason why most of the papers about this model focus on a small set of tasks like PageRank, and also why we don’t see broad adoption of this model for graph analysis tasks, unlike the MapReduce framework, which was very quickly and widely adopted.
Some of the alternative, and more expressive, programming models proposed in recent years include distributed Datalog-based framework used by Socialite [Seo13], data-centric programming models of Ligra [Shu13] and Galois [Ngu13], Green-Marl DSL [Hon12], and NScale framework from our recent work [Qua14]. Unlike the vertex-centric frameworks, however, distributed data-parallel execution is not straightforward for these frameworks, and investigating the trade-offs between expressiveness, ability to parallelize the computations, and ease-of-use remains a crucial challenge. The equivalence between graph analytics and matrix operations [Mat13], and whether that leads to better graph analysis systems, also need to be explored in more depth.
On a related note, the need for distributed execution of graph processing tasks is often taken as a given. However, graphs with 10′s to 100′s of billions of edges can be loaded onto a single powerful machine today, depending on the amount of information per node that needs to be processed (Ligra reports experiments on a graph with 12.9 billion edges with 256GB memory [Shu13]; in our recent work, we were able to execute a large number of streaming aggregate queries over a graph with 300 million edges on a 64GB machine [Mon14a]). Given the difficulty in distributing many graph analysis/querying tasks, it may be better to investigate approaches that eliminate the need to execute any single query or task in a distributed fashion (e.g., through aggressive compression, careful encoding of adjacency lists, or careful staging to disk or SSD (a la GraphChi)), while parallelizing independent queries/tasks across different machines.
Other Open questions
Despite much work, there are many important and hard problems that remain open in graph data management, in addition to the ones discussed above; more challenges are likely to come up as graph querying and analytics are broadly adopted.
Need for a Benchmark: Given the complex tradeoffs, many of the questions discussed above would be hard to answer without some representative workloads and benchmarks, especially because the performance of a system may be quite sensitive to the skew in the degree distribution and the graph diameter. Some of issues, e.g., storage representation, have been studied in depth in the context of RDF triple-stores, but the benchmarks established there appear to focus on scale and do not feature sufficient variety in the queries. A benchmark covering a variety of graph analysis tasks would also help significantly towards evaluating and comparing the expressive power and the performance of different frameworks and systems. Benchmarks would also help reconcile some of the recent conflicting empirical comparisons, and would help shed light on specific design decisions that impact performance significantly.
Temporal and real-time analytics: Most real-world graphs are highly dynamic in nature and often generate large volumes of data at a very rapid rate. Temporal analytics or querying over historical traces can lead to deeper insights into various phenomena, especially those related to evolution or change. One of the key challenges here is how to store the historical trace compactly while still enabling efficient execution of point queries and global or neighborhood-centric analysis tasks [Khu13]. Key differences from temporal databases, a topic that has seen much work, appear to be the scale of data, focus on distributed and in-memory environments, and the need to support global analysis tasks (which usually require loading entire historical snapshots into memory). Similarly real-time querying and analytics, especially anomaly detection, present several unique challenges not encountered in relational data stream processing [Mon14b].
Graph extraction: Another interesting, and practically important, question is how to efficiently extract a graph, or a collection of graphs, from non-graph data stores. Most graph analytics systems assume that the graph is provided explicitly. However, in many cases, the graph may have to be constructed by joining and combining information spread across a set of relations or files or key-value stores. A general framework that allows one to specify what graph or graphs need to be constructed for analysis, and how they map to the data stored in the persistent data stores would significantly simplify the end-to-end process of graph analytics. Even if the data is stored in a graph data store, often we only need to load a set of subgraphs of that graph for further analysis [Qua14], and similar framework would be needed to specify what subgraphs are to be extracted.
The above list is naturally somewhat skewed towards the problems we are working on in our group at Maryland, and towards developing general-purpose graph data management systems. In addition, there is also much work that needs to be done in the application areas like social media, finance, cybersecurity, etc.; in developing graph analytics techniques that lead to meaningful insights in those domains; in understanding what types of query workloads are typical; and in handling those over large volumes of data.
|Blogger Profile: Amol Deshpande is an Associate Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from University of California at Berkeley in 2004. His research interests include uncertain data management, graph analytics, adaptive query processing, data streams, and sensor networks.|
Most professional fields, whether in business or academia, rely on data and have done so for centuries. In the digital age and with the emergence of Big Data, this dependency is growing dramatically – perhaps out of proportion to its current value given the concepts, tools, and techniques presently available. For example, how do you tell if the results of data-intensive analysis are correct and reliable and not weak or even spurious? Most data-intensive disciplines have statistical measures that attempt to calculate meaning or truth. Efficacy quantifies the strength of a relationship within a system, such as biology or business. For example, when researchers investigate a new drug, they compare its effectiveness to a placebo, using statistics to determine whether the drug worked. This approach, where data selection and processing is predicated on complex models rather than simple comparison, is a far cry from select-project-join queries.
Efficacy is the capacity to produce a desired result or effect. In medicine, it is the ability of an intervention or drug to produce an outcome. P-values have been a conventional empirical metric of efficacy for 100 years.
Moreover, the underlying data in these fields is complex, uncertain, and multimodal. Despite a large body of research data management for science applications, there has been little adoption of relational techniques in the science disciplines. In this post, we examine two challenges. First, modeling data around domain-specific efficacy rather than set theory. Second, support for ensembles of data models to enable many perspectives on a single data set.
The big picture is compelling. Since the late 1980’s one of us estimated in papers and keynotes that databases contain less than 10% of the world’s data and dropping fast as non-database data growth exploded. A corresponding fraction of the world’s applications – data and computation – are amenable to traditional databases. Modelling the 90% opens the door for the database community to the requirements of the rest of the world’s data and a new, vastly larger generation of database research and technology. This calls for a shift in our community commensurate with the profound changes introduced by Big Data.
Efficacy first, then efficiency
Since meaning and truth are relative to a system, efficacy measures are of accuracy, correctness, precision, and significance with respect to a context. That we can compute an answer efficiently – at lightning speed over massive data sets – is entirely irrelevant or even harmful if we cannot demonstrate that the answer is meaningful or at least approximately right in a given context. As fields develop and complexity increases, efficacy measures become increasingly sophisticated, refined, and debated. For example, p-values – the gold standard of empirical efficacy – have been questioned for decades, especially under the pressure of increasing irreproducibility in science. The same is true for precision and recall in information retrieval. In fact, since most fields that depend on data involve uncertainty, measures of efficacy are being questioned everywhere, with the notable exception of data management.
Big Data, broadly construed, is inherently multidisciplinary but often lacks the efficacy measures of its constituent disciplines – statistics, machine learning, empiricism, data mining, information retrieval, among others – let alone those of application domains such as finance, biology, clinical studies, high-energy physics, drug discovery, and astrophysics. One reason for this is that efficacy measures that have been developed in the small data world, based on statistics and other fields, do not necessarily hold true over massive data sets . Efficacy in this context is an important, open, and rich research challenge. The value and success of data-intensive discovery (Big Data) depends on achieving adequate means of evaluating the efficacy of its results. A notable exception is the Baylor-Watson result  that focused first on efficacy, i.e., modeling, that then contributed to efficiency. But efficacy is one aspect of a larger challenge – modeling.
Relational data is the servant of the data model and the query. It was right to constrain data when we had a well-defined model. And we could always get the model right – right?
As data management evolved it distinguished itself from information retrieval by not requiring efficacy measures since databases were bounded, discrete, and complied with well-defined models, e.g., schemas. In contrast, information retrieval (and later machine learning) searched data sets for complex correlations rather than rigidly defined predicates. Finding relationships like “select all pairs where their covariance is greater than x” are inherently iterative and compute-intensive. In contrast, the contents of a database either match a query or they do not – black or white. No need for estimating accuracy, confidence, or probabilities. Relational data is the servant of the data model and the query. This permitted massive performance improvements that led to the widespread adoption of databases in applications for which schemas made sense. If the data did not comply with the schema and the query within that, then the data was erroneous by definition and should be rejected or corrected. It was right to constrain data when we had a well-defined model. And we could always get the model right – right? Many courageous researchers over the past 50 years have studied this problem, including probabilistic databases and fuzzy logic, (and more ) but none has seen widespread adoption. Why?
While the non-database world – life sciences, high-energy physics, astrophysics, finance – opened the door to Big Data and its possibilities, the data management world is aspiring to take ownership of their infrastructure – the storage, management, manipulation, querying, and searching of massive datasets. Currently much of this work is done in an ad-hoc manner using tools like R and Python. What is required for a more general solution? The non-database world is driven by applications – solving problems with real-world constraints – achieving efficacy within the models and definitions of their domain – often with 400 or 500 years of history.
In contrast, the database landscape is predominantly concerned with efficiency and has not dealt head on with efficacy yet. Some of these issues have been addressed in the database context in terms of specific models, languages, and design, but seldom have those concerns impacted the core database infrastructure, let alone gained adoption. Perhaps database researchers focused only on application domains that are well-behaved. While efficacy is a critical requirement – possibly the most critical requirement – in domains that make extensive use of data, it is part of the broader requirement for modelling unmet by database systems.
For more than a decade physics, astrophysics, photonics, biology, indeed most physical sciences as well as statistics and machine learning have made the modest assumption that multiple perspectives may be more valuable than a single model.
Data Models for the 90%
The database community, like many others, perhaps has not fully internalized the paradigm shift from small to big data. Big Data – or data per se – does not create the change nor is itself the change. Big Data opens the door to a revolution in thinking. One aspect involves data-driven methods. A profound shift involves viewing phenomena from multiple perspectives simultaneously.
A significant aspect of this shift is that every Big Data activity (small data activities also, but with less impact) requires measures of efficacy for each perspective or model. This is not simply owing to the reframing of corresponding principles from empirical science, but also to the multiple meanings of data, each of which requires mechanisms for addressing efficacy.
If the data management community is about to provide solutions for this nascent challenge, then it will need to deal with efficacy. This essentially has to do with modelling, a chaotic and ad-hoc database topic that has been largely unsuccessful, again measured by adoption. The relational model has dominated databases for over 40 years largely owing to efficiency. The database community knows how to optimize anything expressed relationally. While the relational model has proven to be amazingly general, its adoption has been limited in many domains, especially the sciences.
A related limitation of the database world is the assumption of a single perspective, e.g., a single version of truth, one schema per database even with multiple views. For more than a decade physics, astrophysics, photonics, biology, indeed most physical sciences as well as statistics and machine learning have made the modest assumption that multiple perspectives may be more valuable than a single model.
In  the author argued that science undergoes paradigm shifts only when there are rival theories about the fundamentals of a discipline. It is his position that rival paradigms are incommensurable using entirely different concepts and evaluation metrics from one another. One such example was the wave and particle theories of light. Each has entirely different models and measures of efficacy. Understanding the big picture necessitates finding consistencies and anomalies in both theories.
Ensemble models are one approach to addressing this challenge. Let’s consider an example in evolutionary biology where researchers use a collection of models to learn about how the human genome has changed over time. In  the authors identified positive examples of natural selection in recent human populations. Their discoveries have two parts: the affected gene’s location and its (improved) mutation. By composing many signals of natural selection, the authors increase the resolution of their genomic map by up to 100x. This research computes genetic signals at many levels, from clustering genes that are likely to be inherited together to looking at the high-level geographic distribution of different mutations. In present database modeling, the former might be represented as a graph database, whereas the latter is more likely to fall into the purview of geospatial databases. How can we bring them together? Perhaps neither of these models is designed for computing how effective different genetic variations are at producing advantageous traits. This pattern repeats itself in meteorology, physics, and a myriad of other domains that mathematically model large, dynamic systems.
Stepping into the void of uncertainty, unboundedness, ensemble models, and open-ended model exploration is far harder and scarier. We call it Computing Reality
Ensemble models pose substantial challenges to the data management community. How do you simultaneously store, manage, query, and update this variety of models, applying to a single dataset with many, possibly conflicting schemas? Database folks may first be concerned about doing this efficiently. Nope – wrong question. The first step is to understand the problem, to ask the right questions, to get the model correct and only then to make it efficient. How do you support ensemble models and their requirements including efficacy?
This may be why application domains that use massive data sets have grown their own data management tools, such as Hadoop, ADAM, Wikidata, and Scientific Data Management Systems, let alone a plethora of such tools in most physical science communities that the database community has never heard of. It’s not just that their data does not fit the relational model; databases do not support ensemble models, efficacy, or many of the fundamental concepts used to understand data. Why would any application domain (e.g., physical sciences, clinical studies, drug discovery) or discipline (e.g., information retrieval, machine learning, statistics) want to partner with an infrastructure technology that did not support its basic principles?
The database community has developed amazing technology that has changed the world. Since the early 1990’s it has extended its models to non-relational models such for networks, text, graphs, arrays, and many more. But efficacy is not just an issue of expressing eScience applications relationally, as UDFs or in R, but modeling and computing hypotheses under the complex contexts defined by domain experts, none of which map easily to set theory or other discrete mathematics. Stepping into the void of uncertainty, unboundedness, ensemble models, and open-ended model exploration is far harder and scarier. We call it Computing Reality .
Big Data is opening the door to a paradigm shift in many human endeavors. Machine learning was first through the door with real, albeit preliminary, results and it is already on to the next generation with deep learning . Analytics and other domains are riding the wave of machine learning. The database community is heading for the door now, but it will be challenging. We first have to understand the problem and get the requirements right. To paraphrase Ron Fagin, we need to focus on asking the right questions. The rest may be a breeze but efficacy before efficiency!
So not only are we leaving the relational world that was dominated one model or a class of discrete models, but we are leaving the world of a single model for each dataset and embarking on a journey into a world of ensemble models of including probabilistic, fuzzy, and even potentially the richest model of them all, a model-free approach that enables us to listen to the data. All at scale. This seems scary to us but also just what we need.
Are we crazy, naive? Isn’t it our mission to dig in this data goldmine, to contribute to accelerating scientific discovery? What do you think? We are all ears.
 Duggan, Jennie and Michael L. Brodie, Hephaestus: Virtual Experiments for Data-Intensive Science, In CIDR 2015 (to appear)
 Gomes, Lee. Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Other Huge Engineering Efforts, IEEE Spectrum, 20 Oct 2014
 Grossman, Sharon R., et al. “A composite of multiple signals distinguishes causal variants in regions of positive selection.” Science 327.5967 (2010): 883-886.
 National Research Council. Frontiers in Massive Data Analysis. Washington, DC: The National Academies Press, 2013
 Scott Spangler, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R. Pickering, Austin Comer, Jeffrey N. Myers, Ioana Stanoi, Linda Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence Donehower, Ying Chen, and Olivier Lichtarge. 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’14). ACM, New York, NY, USA, 1877-1886. DOI=10.1145/2623330.2623667 http://doi.acm.org/10.1145/2623330.2623667
 Kuhn, Thomas S. The structure of scientific revolutions. University of Chicago press, 2012.
| Bloggers’ Profiles:
Dr. Brodie has over 40 years experience in research and industrial practice in databases, distributed systems, integration, artificial intelligence, and multi-disciplinary problem solving. He is concerned with the Big Picture aspects of information ecosystems including business, economic, social, application, and technical. Dr. Brodie is a Research Scientist, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology; advises startups; serves on Advisory Boards of national and international research organizations; and is an adjunct professor at the National University of Ireland, Galway. For over 20 years he served as Chief Scientist of IT, Verizon, a Fortune 20 company, responsible for advanced technologies, architectures, and methodologies for Information Technology strategies and for guiding industrial scale deployments of emergent technologies. His current research and applied interests include Big Data, Data Science, and data curation at scale and a related start up Tamr.com. He has served on several National Academy of Science committees. Dr. Brodie holds a PhD in Databases from the University of Toronto and a Doctor of Science (honoris causa) from the National University of Ireland.
Jennie Duggan is a postdoctoral associate at MIT CSAIL working with Michael Stonebraker and an adjunct assistant professor at Northwestern University. She received her Ph.D. from Brown University in 2012 under the supervision of Ugur Cetintemel. Her research interests include scientific data management, database workload modeling, and cloud computing. She is especially focused on making data-driven science more accessible and scalable.
Crowdsourcing is a powerful paradigm that democratizes the creation, collection, analysis, curation, and dissemination of data, contributed by millions of individuals, who are called workers. Early crowdsourcing platforms in the context of citizen reporting and citizen sciences were dedicated to specific tasks. Today, many other platforms that expect workers with different expertise exist. Among them, we can find Turkit, Mob4hire, uTest, Freelancer, eLance, oDesk, Guru, Topcoder, Trada, 99design, Innocentive, CloudCrowd, and CloudFlower. The most popular one is Amazon Mechanical Turk (AMT) that is generic and allows any requester to post virtually any kind of task in which any worker can participate via self-appointment.
Tasks can be as simple as binary requests such as recognizing a landmark in a city, or more sophisticated ones such as meta-data enrichment (picture and video tagging in Flickr or YouTube, food description in Open Food Facts), opinion solicitation (restaurants reviews), collaborative intelligence (city map alignment in http://maps.nypl.org/warper/, identifying stars in GalaxyZoo, character recognition in Recaptcha), and knowledge-intensive crowdsourcing with tasks such as collaborative editing (Wikipedia) and idea generation. The database community has been offering its own platforms (e.g., Qurk on top of AMT and Deco for query processing) and today, scientists are resorting to crowdsourcing for a particular kind of task, that is, quality experiments to validate their findings.
There are several challenges one faces in crowd-based quality experiments. Some are inherent to using a crowd. For example, worker filtering to make sure only qualified ones participate, has to be done by task requesters. Those qualification tests are often difficult to design and do not always lead to flawless experiments. Data collection to build worker profiles is also a challenge (e.g., workers are asked to rate at least 30 movies to build a profile). Other challenges depend on the actual platform. The availability of a large worker pool and the ability to easily set up virtually any task, make AMT the platform of choice for most research experiments. In fact, AMT is used by most researchers in the DB community, and tasks range from answering simple binary questions such as determining whether a tweet expresses a positive or a negative sentiment, to more sophisticated tasks that require some domain knowledge such as collaborative journalism. However, the closed nature of AMT (and of all crowdsourcing platforms we know of today) forces task requesters to find workarounds outside the system in order to select the right pool of workers and assign to them the right set of tasks (AMT is often used as a hiring platform only and workers are redirected to another website). Moreover, monetary compensation, as is the case in AMT, is not always the best choice for incentivizing workers (think open-source contributions) and tends to attract more cheaters when the reward is too high. To address that, post-quality control is enforced via majority voting or by monitoring task completion time. The bottom line is that in a volatile environment, enforcing some level of correctness is difficult, time-consuming and costly.
I have used AMT in several qualitative analyses. The first experiment was in 2009  to evaluate travel itineraries extracted from Flickr against carefully handcrafted ones. It was a large-scale experiment “to validate the wisdom of the crowd”. About 450 workers participated in the experiment and evaluated itineraries in 5 popular and geographically distributed cities (San Francisco, New York City, London, Paris and Barcelona). Half of the workers were involved in an independent study that evaluated the usefulness of our itineraries, the points of interests they contained and the landmark visit and transit times. The other half participated in a comparative study where our itineraries were put side-by-side with handcrafted ones and were preferred in a large majority of the cases. It worked so well that the Wall Street Journal, a prominent newspaper in the US, talked about it . In this experiment, special attention was dedicated to the design of qualification tests to filter out workers. We had to think carefully about a test for which answers are not readily available with simple Web search. Below is one of the tests we used.
The second time we used AMT  relied on input from 100 workers to evaluate how well different rating aggregation functions (least misery function, majority voting, pairwise disagreement, and global average) to compute movie recommendations from user groups with different characteristics (small/large groups, similar/dissimilar groups). We found several insightful and sometimes surprising results including that least misery, i.e., optimizing for the lowest rating in a group, performed better than averaging ratings and is the best aggregation function for small groups formed by similar users. Pairwise disagreement on the other hand, did very well with large groups of dissimilar users. In the first round of that experiment, we had inconsistent findings. We then looked carefully in AMT logs and realized that there were cheaters whose task completion time was unrealistic.
The latest work we used AMT for was crowdsourcing the validation of articles produced collaboratively by a crowd to another crowd . Using the crowd to validate the crowd is in fact a common practice and has shown to work well in many applications such as sentence translation by non-experts. Participating workers were instructed to join a group and together write a short summary of current world events on some topic. The qualification test included answering factual questions on those events. Of course, cheaters could use their favorite search engine to find answers to the qualification test.
I hope that the examples above illustrate both the advantage of having a generic platform that enables reaching out to a large worker pool but also showcases the challenges of using crowds for qualitative analysis. Using crowds bears similarities to running surveys and to using cohorts in medical drug trials. While there are practices on how to select such cohorts and ensure result quality (e.g., controlled/test groups, independently chosen subjects, double blind tests in medical trials), none exist for running a crowd-based quality experiments.
With my colleagues Beatrice Valeri and Shady El Bassuoni, we examined 4395 papers published in SIGMOD, PVLDB, ICDE, EDBT, ICDM, KDD, ICWSM, RecSys, and Hypertext between 2010 and 2014 (2014 statistics are partial) and found that 60 of them used crowd-based quality experiments. Among those, all of them use AMT or CrowdFlower, a layer on top of AMT. Most tasks are about collecting golden data used in evaluation and among those, labeling data is the most common. 48 out of the 60 papers have at least one author in the USA. European authors contributed to a total of 14 papers followed by Israel (6 papers), China (4 papers), Qatar and Canada (3 papers each), Singapore and Japan (2 papers each). As shown below, the number of publications that crowdsource experiments has been increasing with a worker base ranging from around 20 to 500 per experiment.
Most papers use workers’ acceptance rate (recorded in AMT based on previous tasks completed by those workers) and only a few use a thorough qualification test. Those tests are followed by a post-processing step where majority voting, manual checking (by involving another crowd), task completion time, and answer justification, are used to check workers’ input. There goes design efficiency.
What is our opportunity today?
I do not expect our community to suddenly change the course of things. I myself still run experiments on AMT and will continue to do it. Of course, there are questions related to where to start and how to maintain the platform. Crowd4U (crowd4u.org) is a good place to start. It’s an all-academic generic platform that is being developed at the Univ. of Tsukuba and that is open to the rest of the academic world. Here in Grenoble, we ran a campaign on campus where we advertised crowdsourcing and gave out cookies and drinks (not totally free!) and passers-by performed over 1600 tasks to label tweets in less than 3 hours.
You can start your own platform or join Crowd4U but keep it open! In , we argue that human factors such as worker skills, worker availability, expected wage, and ability to work together, could be accounted for to make better use of the crowd. Such parameters can only be tested in an open and generic crowdsourcing system where task assignment and skill learning algorithms can be deployed for virtually any task.
So let’s take this opportunity, have some wisdom, and build and nurture our own crowdsourcing platform(s).
 Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das, Cong Yu: Space efficiency in group recommendation. VLDB J. 19(6): 877-900 (2010)
 Senjuti Basu Roy, Ioanna Lykourentzou, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das: Optimization in Knowledge-Intensive Crowdsourcing. CoRR abs/1401.1302 (2014)
| Blogger’s Profile:
Sihem Amer-Yahia is a 1st class CNRS (Centre National de Recherche Scientifique) Research Director at LIG (Laboratoire d’Informatique de Grenoble) in France. Sihem heads the SLIDE group (ScaLable Information Discovery and Exploitation) that sits at the intersection of large-scale data management and Web data analytics with an emphasis on the social and the semantic Web. Until July 2012, Sihem was Principal Scientist at the Qatar Computing Research Institute (QCRI) where she led a group in Social Computing and worked with local Universities on student mentoring and with Al Jazeera Online on news traffic analytics. From 2006 to 2011, she was Senior Scientist at Yahoo! Research and worked on revisiting relevance models and scalable top-k processing algorithms for Delicious, Yahoo! Travel and Personals, and Flickr. Before that, she spent 7 years at at&t Labs in New Jersey, working on XML query optimization and XML full-text search in conjunction with the W3C. Sihem has served on the SIGMOD Executive Board, is a member of the VLDB and the EDBT Endowments. She serves on the editorial boards of ACM TODS, the VLDB Journal and the Information Systems Journal. She was track chair of SIGIR 2013 and of PVLDB 2013. She is PC chair of EDBT 2014 and will be PC chair of BDA 2015 (French DB conference) and PC co-chair of SIGMOD Industrial 2015. Sihem received her Ph.D. in Computer Science from Paris-Orsay and INRIA in 1999, and her engineering degree from ESI/INI, Algeria.
Exploratory search has gained prominence in recent years. There is an increased interest from several scientific communities, from the information retrieval and database to human-computer interaction and visualization communities, in moving beyond the traditional query-browse-refine model supported by web search engines and database systems alike, and towards support for human intelligence amplification and information understanding. Exploratory search systems should inherently support symbiotic human-machine relationships that provide guidance in exploring unfamiliar information landscapes [Whi 09]. In this post, four researchers set to explore exploratory search from different angles and study the emerging needs and objectives as well as the challenges and problems that need to be tackled.
Exploratory Search in Structured Data: Data Warehouses and OLAP to the Rescue?
Looking more closely at what problems the field of exploratory search encompasses, Marchionini has established two main aspects of exploratory search that go beyond the classical problem of looking up information using classical search mechanisms [Mar 06]. More specifically, exploratory search has the general goal of (i) learning, thus acquiring new knowledge and (ii) investigating to possibly reveal new facts. Sub-tasks to achieve these goals include:
Clearly, exploratory search applies to both structured and unstructured data, as we want to leverage as much data as possible in the above tasks. However, if we focus on structured data, we notice that quite a few of the tasks mentioned above resemble those for which data warehouses and online analytical processing (OLAP) have been developed. One example of a learning task supported by data warehouses is integration, as one of the main purposes of data warehouses is to integrate data from distributed and heterogeneous sources in a single repository. Consequently, data comparison is easily supported as well. As for supporting investigation tasks, OLAP and data mining algorithms readily support analysis of data stored in a data warehouse, and reports synthesize such data. Finally, data warehouses are deployed to support decision making, such as planning and forecasting.
So are the problems of exploratory search already solved for structured data through data warehouses and OLAP?
To some extent yes but there are still many areas of exploratory search that data warehouses do not cover, such as socialization, comprehension and interpretation, and discovery. In the following, I will highlight three main areas where I believe exploratory search will require novel approaches to reach its full potential.
Discovery. Data warehouses are designed based on a predefined information demand, meaning that the analytical queries they support are fixed in advance and for sure cannot go beyond queries that the schema of the data warehouse supports. Essentially, the data warehouse is designed to efficiently answer predefined queries. Opposed to that, in exploratory search, the information demand is unknown a priori, the goal of discovery being to also reveal things that users of the system did not think about during design time and thus are not covered by the rigid schema. Hence, system design for exploratory search needs to plan for the unplanned. Making an analogy to farming, we can say that whereas data warehouses and OLAP are useful for harvesting what has been planted, exploratory search is all about exploring and laboring new land.
Adaptation. Given the rigid cage schemas impose on a data warehouse, these systems typically have limited capabilities for changing or evolving information needs. But these are the essence of exploratory search, as learning is an iterative process that requires a search-refine-expand paradigm. That is, whereas data warehouses allow us to freely move in a cage, exploratory search systems need to be able to adapt to a new and rapidly changing environment.
Data and Users. As a final distinction here, we have limited the discussion above to structured data, but obviously, a wealth of information resides in unstructured data that exploratory search should natively support. This brings us back to a long going discussion on integrating databases and information retrieval. Additionally, many limitations of the data warehouse come from the rigid schema, so could maybe NoSQL databases be of help here? Finally, data warehouses and OLAP are targeted towards expert users that know their domain, whereas exploratory search is for the masses and human-computer interaction will be key to success.
To conclude, we have seen that data warehouses may support to some extent exploratory search, but they have serious limitations when it comes to discovery, adaptability, and the support of all types of relevant data and users.
Exploratory Search and Web searching
Web searching is probably the most frequent user task in the Web, during which users typically get back a linear list of hits. In fact, web search engines are very good in ranking, and there is a plethora of ranking methods for capturing various aspects (topic relevance, authority, popularity, credibility and personalization). However, ranking is not enough. Ranking is adequate for focalized (else called precision-oriented) information needs, however the majority of information needs are recall-oriented (according to various user studies). For recall-oriented information needs, the user wants to find and understand more than one search hit. Examples include bibliographic survey writing, medical information seeking, patent search, hotel booking, and car buying. In general, recall-oriented information needs have an exploratory nature and aim at decision making (based on one or more criteria). To support such information needs, it is beneficial to provide methods that:
The last years we, in the University of Crete and FORTH-ICS, have been investigating methods that can complement the classical Web searching with functionality for exploratory search. At first, we developed the web search engine (WSE) Mitos, a system which offers faceted search over the results of the submitted queries. Specifically, it supports facets corresponding to metadata attributes of the web pages (static metadata), as well as facets corresponding to the outcome of snippet-based clustering algorithms (a kind of dynamic metadata). The user can then restrict her focus gradually, by interacting with the resulting multidimensional structure through simple clicks.
The system IOS [Faf 12] was developed next, providing analogous features but during query typing (type-ahead search). This functionality is offered for the frequent queries and takes advantage of novel partitioned trie-based indexes for reducing the increased main memory requirements. Subsequently, through XSearch [Faf 12b] we investigated and showed how the available Linked Open Data (LOD) can be exploited for offering facets based on the outcome of entity mining techniques. Again, this approach is applied at query time over the textual snippets of the search hits, where LOD provides information about the names of the entities of interest. This approach has been proved successful for professional search, specifically in the domain of patent search and marine search. With X-ENS [Faf 12c] we focused on configurability, i.e., allowing the end user to configure the entities of interest, and also browse the information that LOD provides for these entities.
However, to offer entity mining not over the snippets, but over the full contents of the search hits, more than one machines are needed for downloading and analyzing the full contents of the hits. A recent work shows how MapReduce can be used for this purpose [Kit 14].
Subsequently, we investigated how we could enable the user to easily express preferences, simple and complicated ones, over the elements the user interacts with. Such preferences can affect the ordering of the different aspects of the multidimensional (and hierarchically organized) information space. We have developed the prototype system Hippalus that realized this approach over a proposed preference framework [Pap 14].
More information about the aforementioned systems is available here.
Based on our experience, we could identify four main topics that we believe current IR/Web search does not cover, and exploratory search will require to reach its full potential:
Ubiquity. Faceted browsing of search results and gradual restriction should be possible for any kind of query, for any domain, and with no predetermined facets. In other words, methods that bypass the need for explicit configuration (regarding facets, entity types, LOD sources) are required.
Fusion of Structured and Unstructured Content. The exploitation of LOD in the exploratory search process is promising, e.g. for Named Entity Recognition and disambiguation. However, the fusion of structured and unstructured content requires more work.
User Control. Explicit, user-provided, and controllable preference management is beneficial for supporting a transparent decision making process. We believe that the framework supported by the Hippalus system is a first step towards this direction.
Evaluation. We need easy-to-follow methods for evaluating the effectiveness of exploratory search methods, and easily reproducible evaluation results. Although the classical IR has well established methodologies for evaluation, things are not so clear and straightforward in interactive IR (IIR).
Exploratory Search and Multimedia Data
By its very nature, multimedia data exploration shares the 3V challenges ([V]olume, [V]elocity, and[V]ariety ) of the so called “Big Data” applications. Systems supporting multimedia data exploration, however, must tackle additional, more specific, challenges, including those posed by the [H]igh-dimensional, [M]ulti-modal (temporal, spatial, hierarchical, and graph-structured), and inter-[L]inked nature of most multimedia data as well as the [I]mprecision of the media features and [S]parsity of the observations in the real-world.
Moreover, since the end-users for most multimedia data exploration tasks are us (i.e., humans), we need to consider fundamental constraints posed by [H]uman beings, from the difficulties they face in providing unambiguous specifications of interest or preference, subjectivity in their interpretations of results, and their limitations in perception and memory. Last, but not the least, since a large portion of multimedia data is human-centered, we also need to account for the users’ (and others’) needs for [P]rivacy.
Multimedia exploration (on data with the above characteristics and by users with the above limitations) is an inherently dynamic process, and systems for multimedia exploration must be able to support, efficiently and effectively, a continuous exploration cycle involving four key steps:
Naturally, each and every step of this multimedia exploration cycle poses significant challenges. While it would be impossible to enumerate all the challenges, I would like to identify the following five “core” challenges:
Unfortunately, while as a community we have done great advances in tackling these core challenges, we probably have to admit that we are still quite far from addressing any of these issues satisfactorily, especially within the context of large multimedia data collections. In fact, at least in the short- to medium-term future, these five issues will continue to form the core challenges in multimedia exploration and retrieval.
If there is one thing that is becoming more urgent, however, it is that the ever-increasing scale and the speed of the data implies that to support the above media exploration cycle, our emphasis must shift towards development of integrated data platforms that can support, in an optimized and scalable manner, both media analysis (feature extraction, clustering, partitioning aggregation, summarization, classification, latent analysis) and data manipulation (filtering, integration, personalized and task-oriented retrieval) operations.
Personalizing Data Exploration: Exploring the Past
Personal data is now pervasive as digital devices are capturing every part of our lives. Data is constantly collected and saved by users, either voluntarily in files, emails, social media interactions, multimedia objects, calendar items, contacts, etc., or passively by various applications such as GPS tracking of mobile devices, records of utility usage, financial transactions, or quantified self sensors. A typical user will have data recorded on many devices, cloud services, and proprietary systems; access to the data can be difficult because of the data fragmentation, security issues, or practical concerns.
Companies and organizations have famously been able to take advantage of the wealth of information produced by individuals, learning patterns and visualizing trends on a large number of data points produced by many users. Yet, individual uses do not have accessible tools to retrieve, manage and analyze their own data.
Leveraging personal data is critical to many data exploration tasks:
Exploring our past. A specific type of data exploration is re-finding [Tee 04], [Ber 08], whose goal is to find information that has been created, received, or seen by the user. Unlike traditional web search or exploration tasks whose focus is usually on discovering new information/documents, a re-finding exploration task has a target object it is attempting to recover.
Fragmentation of data is a main challenge. Users rarely own and store their personal data, with the exception of personal files. Most personal information is stored in the cloud by commercial companies that may offer some limited access, usually via a web browser or an API, to a user’s personal data. Attempting to retrieve and cross-reference personal information then leads to a tedious, often maddening, process of individually accessing all the relevant sources of data and manually linking their information. For example, checking that an insurance claim was correctly processed may require looking at a calendar application to find the doctor’s appointment date, checking the claim status on the insurance web site, and consulting a bank web portal to confirm that the payment was received.
The future of personal data exploration lies in the creation of personal data exploration tools, in the spirit of Dataspaces [Blu 07], [Hal 06], that will integrate personal data from a variety of sources, and allow users to visualize and search through their digital memories based on any piece of information they remember, following threads of information to navigate from one memory to the next.
Exploring our social data. Personal data is not limited to a user’s own data production. In an increasingly connected world, what other users in our social network share with us is also relevant to our data exploration. Users looking for book recommendations may want to explore their social network connections for books they have read or discussed. Some desktop search systems, such as deskWeb [Zer 10], have integrated the user’s social network graph, expanding the searched data set to include information available in the social network.
Personalizing our data exploration. Users have individual habits and patterns, as well as different types of data, context and information needs. By inferring knowledge from their past personal information and their interactions with their data, we can personalize data exploration and fit it to the users’ individual needs.
Social network information can be leveraged to learn contextual information about a user and guide data exploration. For instance, “Student Center” has different meaning for different groups of users. Knowing which social group the user belongs to can help us identify entities, e.g., observing that a majority of the user’s friends go to Rutgers University and attend events at “Busch Campus Student Center, Rutgers University”, increases the likelihood that the phrase “Student Center” in the user’s data and queries refers to that particular student center. Similarly, past queries can be used to learn some information about the type of the entities present in the personal data, or about the personal context.
This type of personalized data exploration is gaining traction in prospective memory applications, such as Google Now, which focus on reminding users of their appointments and to-do items.
Overall, understanding and leveraging personal data is crucial for many data exploration tasks, either because the exploration is on the personal information itself, or because personal information can give significant insights and directions for traditional data exploration needs.
| Blogger Profiles:
Melanie Herschel is an Associate Professor in the Computer Science Department of the University of Paris South and is also member of the INRIA Oak project. Before this, after receiving her Ph.D. in Computer Science from Humboldt-University Berlin (2008), she was a postdoctoral researcher at IBM Research – Almaden (2008 – 2009) and at the University of Tübingen, Germany (2009 – 2011). During her research career, she has focused on topics related to data integration, data quality, data provenance, and, more recently, on processing large amounts of Web Data. Melanie participates in several nationally funded research projects and is an active member of the database research community, as she has served on numerous program committees, is a frequent journal referee, and has organized international workshops.
Yannis Tzitzikas is Assistant Professor in the Department of Computer Science of the University of Crete and Research Associate of FORTH-ICS. His research interests fall in the intersection of the following areas: Information Systems, Information Indexing and Retrieval, Conceptual Modeling, Knowledge Representation and Reasoning. He is one of the editors (and author of several chapters) of the book “Dynamic Taxonomies and Faceted Search” (Springer 2009), he is currently involved in the ongoing EU projects i-Marine, SCIDIP-ES, Aparsen NoW (WP leader), and he is national contact of the MUMIA (Multilingual and Multifaceted Interactive Information Access) Cost Action. His current research revolves around Faceted Interactive Search, and lately on methods that can bridge the world of documents with the world of data at search time. He has published more than 80 papers in refereed international conferences and journals (including ACM Transactions on the Web, VLDB Journal, Journal on Data Semantics, International Semantic Web Conference (winner of best paper award), ESWC, and many others).
K. Selçuk Candan is a Professor of Computer Science and Engineering at the Arizona State University. He has published over 170 journal and peer-reviewed conference articles, one book on multimedia retrieval, and 16 book chapters. He has 9 patents. Prof. Candan served as an associate editor of one of the most respected database journals, the Very Large Databases (VLDB) journal. He is also in the editorial board of the IEEE Transactions on Multimedia and the Journal of Multimedia. He has served in the organization and program committees of various conferences. In 2006, he served as an organization committee member for SIGMOD’06, the flagship database conference of the ACM. In 2008, he served as a PC Chair for another leading, flagship conference of the ACM, this time focusing on multimedia research (MM’08). More recently, he served as a program committee group leader for ACM SIGMOD’10. He also serves in the review board of the Proceedings of the VLDB Endowment (PVLDB). In 2011, he served as a general co-chair for the ACM MM’11 conference. In 2012 he served as a general co-chair for ACM SIGMOD’12. He has successfully served as the PI or co-PI of numerous grants, including from the National Science Foundation, Air Force Office of Research, Army Research Office, Mellon Foundation, HP Labs, and JCI. He also served as a Visiting Research Scientist at NEC Laboratories America for over 10 years. He is a member of the Executive Committee of ACM SIGMOD and an ACM Distinguished Scientist.
Amélie Marian is an Associate Professor in the Computer Science Department at Rutgers University. Her research interests are in Personal Information Management, Ranked Query Processing, Semi-structured data and Web data Management. Amélie received her Ph.D. in Computer Science from Columbia University in 2005. From March 1999 to August 2000, she was a member of the VERSO project at INRIA-Rocquencourt. She is the recipient of a Microsoft Live Labs Award, three Google Research Awards, and an NSF CAREER award.