December 9, 2014
Graph data management has seen a resurgence in recent years, because of an increasing realization that querying and reasoning about the structure of the interconnections between entities can lead to interesting and deep insights into a variety of phenomena. The application domains where graph or network analytics are regularly applied include social media, finance, communication networks, biological networks, and many others. Despite much work on the topic, graph data management is still a nascent topic with many open questions. At the same time, I feel that the research in the database community is fragmented and somewhat disconnected from application domains, and many important questions are not being investigated in our community. This blog post is an attempt to summarize some of my thoughts on this topic, and what exciting and important research problems I think are still open.
At its simplest, graph data management is about managing, querying, and analyzing a set of entities (nodes) and interconnections (edges) between them, both of which may have attributes associated with them. Although much of the research has focused on homogeneous graphs, most real-world graphs are heterogeneous, and the entities and the edges can usually be grouped into a small number of well-defined classes.
Graph processing tasks can be broadly divided into a few categories. (1) First, we may to want execute standard SQL queries, especially aggregations, by treating the node and edge classes as relations. (2) Second, we may have queries focused on the interconnection structure and its properties; examples include subgraph pattern matching (and variants), keyword proximity search, reachability queries, counting or aggregating over patterns (e.g., triangle/motif counting), grouping nodes based on their interconnection structures, path queries, and others. (3) Third, there is usually a need to execute basic or advanced graph algorithms on the graphs or their subgraphs, e.g., bipartite matching, spanning trees, network flow, shortest paths, traversals, finding cliques or dense subgraphs, graph bisection/partitioning, etc. (4) Fourth, there are “network science” or “graph mining” tasks where the goal is to understand the interconnection network, build predictive models for it, and/or identify interesting events or different types of structures; examples of such tasks include community detection, centrality analysis, influence propagation, ego-centric analysis, modeling evolution over time, link prediction, frequent subgraph mining, and many others [New10]. There is much research still being done on developing new such techniques; however, there is also increasing interest in applying the more mature techniques to very large graphs and doing so in real-time. (5) Finally, many general-purpose machine learning and optimization algorithms (e.g., logistic regression, stochastic gradient descent, ADMM) can be cast as graph processing tasks in appropriately constructed graphs, allowing us to solve problems like topic modeling, recommendations, matrix factorization, etc., on very large inputs [Low12].
Prior work on graph data management could itself be roughly divided into work on specialized graph databases and on large-scale graph analytics, which have largely evolved separately from each other; the former has considered end-to-end data management issues including storage representations, transactions, and query languages, whereas the latter work has typically focused on processing specific tasks or types of tasks over large volumes of data. I will discuss those separately, focusing on whether we need “new” systems for graph data management and on open problems.
Graph Databases and Querying
The first question I am usually asked when I mention graph databases is whether we really need a separate database system for graphs, or whether relational databases suffice. Personally I believe that graph databases provide a significant value for a large class of applications and will emerge as another vertical.
If the goal is to support some simple graph algorithms or graph queries on data that is stored in an RDBMS, then it is often possible to do those using SQL and user-defined functions and aggregations. However, for more complex queries, a specialized graph database engine is likely to be much more user-friendly and likely to provide significant performance advantages. Many of the queries listed above either cannot be mapped to SQL (e.g., flexible subgraph pattern matching, keyword proximity search) or the equivalent SQL is complex and hard to understand or debug. An abstraction layer that converts queries from a graph query language to SQL could address some of these shortcomings, but that will likely only cover a small fraction of the queries mentioned above. More importantly, graph databases provide efficient programmatic access to the graph, allowing one to write arbitrary algorithms against them if needed. Since there is usually a need to execute some graph algorithms or network science tasks in the application domains discussed above, that feature alone makes a graph database very appealing. Most graph data models also support flexible schemas — although an orthogonal issue, new deployments may choose a graph database for that reason.
Whether a specialized graph database provides significant performance advantages over RDBMSs for the functionality common to both is somewhat less clear. For many graph queries, the equivalent SQL, if one exists, can involve many joins and unions and it is unlikely the RDBMS query optimizer could optimize those queries well (especially given the higher use of self-joins). It may also be difficult to choose among different ways to map a graph query into SQL. Queries that require recursion (e.g., reachability) are difficult to execute in a relational database, but are natural for graph databases. Graph databases can also employ specific optimizations geared towards graph queries and traversals. For example, graph databases typically store all the edges for a node with the node to avoid joins, and such denormalization can significantly help with traversal queries, especially queries that traverse multiple types of edges simultaneously (e.g., for subgraph pattern matching). Replicating node information with neighbors can reduce the number of cache misses and distributed traversals for most graph queries (at the expense of increased storage and update costs). Similarly, min cut-based graph partitioning techniques help in reducing the number of distributed queries or transactions, and similar optimizations can be effective in multi-core environments as well. On the other hand, there is less work on query optimization in graph databases, and for simple queries (especially simple subgraph pattern matching queries), the query optimizer in relational databases may make better decisions than many of today’s graph databases.
I think exploring such optimizations and understanding the tradeoffs better are rich topics for further research. For example, how the graph is laid out, both in persistent storage and in memory, can have a significant impact on the performance, especially in multi-core environments. We also need to better understand the common access patterns that are induced by different types of queries or tasks, and the impact of different storage representations on the performance of those access patterns. Another key challenge for graph querying is developing a practical query language. There is much theoretical work on this problem [Woo12] and several languages are currently used in practice, including SPARQL, Cypher, Gremlin, and Datalog. Among those, SPARQL and Cypher are based primarily on subgraph pattern matching and can handle a limited set of queries, whereas Gremlin is a somewhat low-level language and may not be easy to optimize. Datalog (used in LogicBlox and Datomic) perhaps strikes the best balance, but is not as user-friendly for graph querying and may need standardization of some of the advanced constructs, especially aggregates.
Unfortunately I see little work on these problems, and on end-to-end graph databases in general, in our community. There are quite a few graph data management systems being actively built in the industry, including Neo4j, Titan, OrientDB, Datomic, DEX, to name a few, where these issues are being explored. Much of the work in our community, on the other hand, is more narrowly focused on developing search algorithms and indexing techniques for specific types of queries; while that work has resulted in many innovative techniques, for wide applicability and impact, it is also important to understand how those fit into a general-purpose graph data management system.
Large-scale Graph Analytics
Unlike the above scenario, the case of new systems for graph analysis tasks, broadly defined to include graph algorithms and network science and graph mining tasks, is more persuasive. Batch analytics systems like relational data warehouses and MapReduce-based systems are not a good fit for graph analytics as is. From the usability perspective, it is not natural to write graph analysis tasks using those programming frameworks. Some graph programming models (e.g., the vertex-centric programming model [Mal10]) can be supported in a relational database through use of UDFs and UDAs [Jin14]; however it is not clear if the richer programming frameworks (discussed below) can also be efficiently supported. Further, many graph analysis tasks are inherently iterative, with many iterations and very little work per vertex per iteration, and thus the overheads of
those systems may start dominating and may be hard to amortize away.
A more critical question, in my opinion, is whether the popular vertex-centric programming model is really a good model for graph analytics. To briefly recap, in this model, users write vertex-level compute programs, that are then executed iteratively by the framework in either a bulk synchronous fashion or asynchronous fashion using message passing or shared memory. This model is well-suited for some graph processing tasks like computing PageRank or connected components, and also for several distributed machine learning and optimization tasks that can be mapped to message passing algorithms in appropriately constructed graphs [Low12]. Originally introduced in this context in Google’s Pregel system [Mal10], several graph analytics systems are built around this model (e.g., Giraph, Hama, GraphLab, PowerGraph, GRACE, GPS, GraphX).
However, most graph analysis tasks (e.g., the popular modularity optimization algorithm for community detection, betweenness centralities) or graph algorithms (e.g., matching, partitioning) cannot be written using the vertex-centric programming model while permitting efficient execution. The model limits the compute program’s access to a single vertex’s state and so the overall computation needs to be decomposed into smaller local tasks that can be (largely) independently executed; it is not clear how to do this for most of the computations discussed above, without requiring a large number of iterations. Even local neighborhood-centric analysis tasks (e.g., counting motifs, identifying social circles, computing local clustering coefficients) are inefficient to execute using this model; one could execute such a task by constructing multi-hop neighborhoods in each node’s local state by exchanging neighbor lists, but the memory required to hold that state can quickly make it infeasible [Qua14]. I believe these limitations are the main reason why most of the papers about this model focus on a small set of tasks like PageRank, and also why we don’t see broad adoption of this model for graph analysis tasks, unlike the MapReduce framework, which was very quickly and widely adopted.
Some of the alternative, and more expressive, programming models proposed in recent years include distributed Datalog-based framework used by Socialite [Seo13], data-centric programming models of Ligra [Shu13] and Galois [Ngu13], Green-Marl DSL [Hon12], and NScale framework from our recent work [Qua14]. Unlike the vertex-centric frameworks, however, distributed data-parallel execution is not straightforward for these frameworks, and investigating the trade-offs between expressiveness, ability to parallelize the computations, and ease-of-use remains a crucial challenge. The equivalence between graph analytics and matrix operations [Mat13], and whether that leads to better graph analysis systems, also need to be explored in more depth.
On a related note, the need for distributed execution of graph processing tasks is often taken as a given. However, graphs with 10’s to 100’s of billions of edges can be loaded onto a single powerful machine today, depending on the amount of information per node that needs to be processed (Ligra reports experiments on a graph with 12.9 billion edges with 256GB memory [Shu13]; in our recent work, we were able to execute a large number of streaming aggregate queries over a graph with 300 million edges on a 64GB machine [Mon14a]). Given the difficulty in distributing many graph analysis/querying tasks, it may be better to investigate approaches that eliminate the need to execute any single query or task in a distributed fashion (e.g., through aggressive compression, careful encoding of adjacency lists, or careful staging to disk or SSD (a la GraphChi)), while parallelizing independent queries/tasks across different machines.
Other Open questions
Despite much work, there are many important and hard problems that remain open in graph data management, in addition to the ones discussed above; more challenges are likely to come up as graph querying and analytics are broadly adopted.
Need for a Benchmark: Given the complex tradeoffs, many of the questions discussed above would be hard to answer without some representative workloads and benchmarks, especially because the performance of a system may be quite sensitive to the skew in the degree distribution and the graph diameter. Some of issues, e.g., storage representation, have been studied in depth in the context of RDF triple-stores, but the benchmarks established there appear to focus on scale and do not feature sufficient variety in the queries. A benchmark covering a variety of graph analysis tasks would also help significantly towards evaluating and comparing the expressive power and the performance of different frameworks and systems. Benchmarks would also help reconcile some of the recent conflicting empirical comparisons, and would help shed light on specific design decisions that impact performance significantly.
Temporal and real-time analytics: Most real-world graphs are highly dynamic in nature and often generate large volumes of data at a very rapid rate. Temporal analytics or querying over historical traces can lead to deeper insights into various phenomena, especially those related to evolution or change. One of the key challenges here is how to store the historical trace compactly while still enabling efficient execution of point queries and global or neighborhood-centric analysis tasks [Khu13]. Key differences from temporal databases, a topic that has seen much work, appear to be the scale of data, focus on distributed and in-memory environments, and the need to support global analysis tasks (which usually require loading entire historical snapshots into memory). Similarly real-time querying and analytics, especially anomaly detection, present several unique challenges not encountered in relational data stream processing [Mon14b].
Graph extraction: Another interesting, and practically important, question is how to efficiently extract a graph, or a collection of graphs, from non-graph data stores. Most graph analytics systems assume that the graph is provided explicitly. However, in many cases, the graph may have to be constructed by joining and combining information spread across a set of relations or files or key-value stores. A general framework that allows one to specify what graph or graphs need to be constructed for analysis, and how they map to the data stored in the persistent data stores would significantly simplify the end-to-end process of graph analytics. Even if the data is stored in a graph data store, often we only need to load a set of subgraphs of that graph for further analysis [Qua14], and similar framework would be needed to specify what subgraphs are to be extracted.
The above list is naturally somewhat skewed towards the problems we are working on in our group at Maryland, and towards developing general-purpose graph data management systems. In addition, there is also much work that needs to be done in the application areas like social media, finance, cybersecurity, etc.; in developing graph analytics techniques that lead to meaningful insights in those domains; in understanding what types of query workloads are typical; and in handling those over large volumes of data.
Blogger Profile: Amol Deshpande is an Associate Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from University of California at Berkeley in 2004. His research interests include uncertain data management, graph analytics, adaptive query processing, data streams, and sensor networks.
Copyright @ 2014, Amol Deshpande, All rights reserved.
Comments are closed