June 16, 2017
Graphs are a ubiquitous model to represent objects and their relations. Examples include communication and computer systems, the World Wide Web, online social networks, biological networks, transportation systems, epidemic networks, chemical networks, and hidden terrorist networks. All these systems can be modeled as graphs, in which individual components interact with a specific set of components, resulting in massive, interconnected, and heterogeneous networks. To query these networks, one often needs to identify the matches of a given query graph in a typically large network modeled as a target graph. This problem is referred to as the graph pattern matching, which is increasingly becoming relevant in knowledge graph exploration, biological network analysis, cheminformatics, root cause, malware, intrusion, and fraud detection, web site classification, social search, network traffic and intelligence monitoring, among many others. The most recent example is, perhaps, the 2017 Pulitzer Prize-winning Panama Papers investigation, led by the International Consortium of Investigative Journalists (ICIJ), which exposed highly connected networks of offshore tax structures used by the world’s richest elites. These structures were uncovered from leaked financial documents, and then connected together creating a graph in Neo4j, which was made accessible using Linkurious’ visualization application .
Broadly speaking, graph matching queries can be classified into three classes :
1. Subgraph/ supergraph containment query,
2. Graph similarity search, and
3. Graph pattern matching
In the first type of query, the input is a graph database consisting of many graphs—however, each graph in the database is usually smaller in size (e.g., chemical compound structure, call graphs with tens of thousands of nodes and edges) — and the results are all those graphs from the database which are subgraph/supergraph of the query graph. In the second category, the input also consists of a graph database with several graphs, and the results are those graphs that are graph-isomorphic (or, similar) to the query graph. In the third category, the input is only one large graph, and the query identifies all occurrences of the query graph in that data graph. We shall focus on the third category, i.e., graph pattern matching query, which is to identify all occurrences of a small query graph in a large target graph. With the prevalence of large-scale networks (having millions of nodes and billions of edges), such as knowledge graphs, social networks, XML and web graphs, protein interaction and brain networks, the problem of effectively and efficiently finding all occurrences of a small query graph in a large target network is of high importance.
Graph pattern matching queries have three primary aspects: (a) problem definition, (b) users’ query formulation, (c) query processing algorithms and systems. The first aspect deals with formally defining the problem, such as graph pattern matching via a strict notion of subgraph isomorphism. However, due to several practical reasons that we shall introduce shortly, often a relaxed notion of graph pattern matching is preferred, allowing more flexibility and provisions for approximation. The second aspect of a graph pattern matching query requires the user to formulate and input her query, e.g., by traditional means such as writing a declarative SQL/SPARQL/Cypher query. This is where the user-friendliness of graph pattern matching queries comes into the picture, and we shall elaborate it in the second-half of our article. The third aspect, of course, is query processing (both algorithms and systems), including various databases, data mining, and machine learning techniques, such as graph indexing and partitioning, query decomposition, enumeration, and optimization, distributed graph processing systems, database join vs. graph exploration, etc., which are interesting research problems in their own merits. In fact, whether we require dedicated graph databases and graph processing engines as opposed to legacy relational database management systems received a lot of attention in recent times both in academia and in industry [2, 3]. We, however, carefully limit the scope of this article. In the following, we primarily focus on the first two aspects – approximation and user-friendliness of graph pattern matching queries; associated challenges, resolution techniques, and open problems.
Graph pattern matching can be defined via a strict notion of subgraph isomorphism, which is an injective function from the nodes in the query graph to the nodes in the target graph, and allows one-to-one matching between the nodes and edges. While the decision version of subgraph isomorphism is NP-complete; in reality, the problem could be solved in polynomial time (in size of target graph) due to: (i) bounded size of the query graph, and (ii) consideration of labels on nodes and edges. The notion of labeled graphs is practically useful to reduce the size of the search space, because a query node (resp. a query edge) can only be matched with target nodes (resp. target edges) having similar labels. In general, more selective labels and structures in the query graph reduce the number of possible matches. In case of labeled graphs, if the query graph has total k nodes v1, v2, …, vk, and if the number of candidate nodes (in target graph) based on label matching for each query node vi is |C(vi)|, then the search space has size: |C(v1)| × |C(v2)| × . . . × |C(vk)|. Note that although the problem becomes tractable, the search space can still be quite large for massive target graphs, and also due to the presence of less selective query nodes. Therefore, even for labeled graphs, enumerating all possible answer graphs within the search space can be expensive, leading the way towards more efficient definition of graph pattern matching (e.g., graph simulation and bi-simulation ).
In addition to complexity, there could be several other reasons in opting for relaxed definitions. Some applications might require matching multiple query nodes to one target node, and vice versa, i.e., one query node with multiple target nodes. Graph homomorphism allows the former, while both graph simulation and bi-simulation define a “relation” (as opposed to “function” in case of subgraph isomorphism and graph homomorphism) between the query and target nodes, thereby permitting one query node to be matched with multiple target nodes. The decision versions of subgraph isomorphism and graph homomorphism problems are NP-complete. Graph simulation and bi-simulation, on the other hand, can be computed within polynomial time.
Among the other reasons opting for a relaxed definition of graph pattern matching could be the presence of noise and ambiguity in the target graph – it might not always conform to a fixed schema, or it may not even have one. As an example, we can think of the RDF data model for the target graph, which is semi-structured in nature (compare it with the relational model where the entities, their attributes, and relationships to other entities are strictly defined) – the schema can evolve over time, which facilitates data merging even if the underlying schemas differ. Besides, due to the heterogeneous nature of large networks (e.g., Google knowledge graph), the schema itself can be quite large and complex, the users may not have the expertise or time to completely understand the schema before issuing a small query. Often time, users are not certain about their precise information need, hence they cannot adequately formulate the query, or they could be simply exploring a large network. These scenarios make it unrealistic to find exact matches based on a stricter definition of graph pattern matching (there might be no exact matches). Rather, it would be more appealing to find the top-k approximate matches according to some cost function, where the cost function measures the similarity between the query graph and its match found in the target graph. Providing the top-k matches also enables a user to re-formulate her query, and possibly obtain more relevant answers in case of graph exploration.
The definition of cost functions to find the top-k matches can be application specific. More traditional measures include graph edit distance, maximum common subgraph, number of missing edges, and various graph kernels. Approximate graph matching techniques such as PathBlast, SAGA, NetAlign, and IsoRank target bioinformatics applications. G-Ray  aims to maintain the shape of the query. TALE  uses edge misses to measure the quality of a match. We developed NESS and NeMa  to incorporate more flexibility in graph pattern matching – an edge in the query graph can be matched with a path (up to a certain number of hops) in the target graph. Hence, both these methods identify the top-k matches based on the proximity among entities in the neighborhood, which we find more effective to deal with higher amount of noises and mismatches between the query graph and its probable matches in the target graph, specifically in the domain of knowledge graphs and social networks.
The above is definitely not a complete list of works on relaxed graph pattern matching. More work is necessary to understand the approximation requirements for different applications and domains (e.g., bioinformatics vs. knowledge graph applications), and how effective and efficient the current methods are across these domains. It would be also interesting to learn user-specific requirements, as opposed to providing a set of approximation definitions (e.g., graph homomorphism vs. graph simulation vs. top-k matching) as black boxes.
One major challenge for graph pattern matching is that non-professional users are not familiar with the complex schema and variational information descriptions. It becomes very hard for users to formulate a query (e.g., SPARQL or subgraph pattern) that can be properly processed by the existing systems. We envision a user-friendly graph query engine to have three features:
1. to be able to interpret high-level graph analytical queries (expressed by e.g., keyword or NLP) into well-optimized low-level operators and algorithms,
2. to leverage learning methods to derive adaptive result ranking models, and
3. to be equipped with an interpretation layer that “translates” the answers back to users.
Such a system fills the gap between end-users and query evaluation. There has been ongoing work for each of these components: Schema-less and structure-less search , various easy-to-understand transformations such as synonym, abbreviation, and ontology, and learning graphical models to rank answers. Graph summarization  provides interpretable patterns to help both result explanation and query evaluation. A promising direction can be the development of user-friendly querying systems that integrate all the three aspects. Bulk of the literature on user-friendly graph pattern matching queries predominantly employed four approaches: (i) graph keyword search, (ii) keyword query reformulation/ re-writing, (iii) graph query-by-example, and (iv) visual graph query.
In graph keyword search, the query involves a set of keywords (e.g., node labels) and the result is either a connected sub-tree or a connected sub-graph containing all query keywords. Various ranking criteria also exist to find the top-k answers, e.g., sum of all edge weights in the resulting tree/ graph, sum of all path weights from root to each keyword in the tree, maximum pairwise distance among nodes, etc. Notable methods include BANKS, bidirectional search, and BLINKS. In essence, keyword search identifies the most relevant regions of the target graph containing those keywords. The query reformulation technique, on the contrary, converts the keyword query into a more structured format, e.g., SPARQL or SQL query, and then evaluates it. Given the set of keywords, the top-k structured queries are identified by considering term similarity, their co-occurrences, and semantic relationships in the target graph. A relatively new paradigm in area of user-friendly graph querying is graph-query-by-example.
Query-by-example (QBE) has a positive history in relational databases, HTML tables, and entity sets. Exemplar query  and GQBE  adapted similar ideas over knowledge graphs. In particular, the user may find it difficult how to precisely write her query, but she might know a few answers to her query. A graph query-by-example system will allow her to input the answer tuple as a query, and will return other similar tuples that are present in the target graph. The underlying system follows a two-step approach. Given the input example tuple(s), it first identifies the query graph that captures the user’s query intent. Then, it evaluates the query graph to find other relevant answer tuples. Complementary to these approaches, visual graph querying provides a visual interface to directly input the graph query. In addition, it can also provide guidelines for constructing graph queries visually, interleaving processing of graph queries with query construction, as well as visual exploration of query results .
Scalability and usability are two sides of the same coin for big graph analysis. There have been several methods for querying large graphs: 1) Indexing , that reduces the search space via auxiliary indices; 2) Compression and accessing of the graphs via limited decompression to answer queries ; 3) Data synopses for aggregate queries using histograms, wavelets, clustering and sampling; 4) Distributed and parallel programming models including bulk-synchronous model (e.g., Pregel ), asynchronous vertex-centric model (GraphLab ), and graph-centric model , and 5) Approximate evaluation  that rewrite queries to computationally more eﬃcient counterparts with a tolerable loss in answer quality. In general, it takes great programming and tuning effort to optimize the queries over graph data.
Recent research starts to “kill two birds with one stone’’, addressing both scalability and usability. (1) Resource-bounded search aims to “make Big Data small”, by (automatically) inducing the most relevant fraction of graphs w.r.t a query first, and then applies standard querying algorithms by only accessing the selected (typically small) data . Such selection of “small data” may also be supported by learning techniques , reducing the tuning effort from end users. (2) A parallel system that can support “semi-automatic’’ parallelization of sequential graph algorithms has been developed . Users simply specify three sequential algorithms that perform local computation at each processor, incremental updates of local answers, and assemble answers, respectively. The system “parallelizes” the computation without the effort of recasting the algorithms to their parallel counterparts, which is already hard for end users. Such usability is especially desirable for data centers and cloud applications. Scalable methods and tools that can reduce the programming and performance tuning effort can be an opportunity for Big data systems in general.
Copyright @ 2017, Arijit Khan and Yinghui Wu, All rights reserved.