May 12, 2017
I think we need to take a hard look at approximate query processing.
Don’t get me wrong. The vision of approximate query processing is indeed compelling. In the age of Big Data, the ability to answer analytic queries “approximately”, but at a fraction of the cost of executing the query in the traditional way, is attractive. For batch analytic queries, approximate query processing can increase throughput, and for interactive queries, approximate query processing can reduce latency. Thus, if we can realize the potential of approximate query processing, we may be able to speed up our ability to explore a vast amount of data inexpensively and quickly.
So, what’s the problem with the state-of-the-art in approximate query processing? Well, we have sold the community on the vision, have had many technical results, and even built prototype database engines to support approximate query processing. But, are we close to seeing approximate query processing techniques go mainstream and be offered with data platforms? Here is my litmus test to answer that question – do approximate query processing techniques truly reduce the cost (or latency) of executing queries by an order of magnitude with small error (negligible loss of accuracy) and able to handle most complex decision support queries? In my opinion, as of now, the answer is negative.
How do we quantify the error in answering a query? Even for a single block Select-Project-Join query with Group-By and aggregation, the problem of defining an error metric is hard. In the approximate answer, some groups could be missing altogether, and each of the groups can have their own confidence interval for their aggregate value. Is that the ideal way to capture the accuracy of an approximate answer or should we attempt to define a single error metric for the entire query that captures cumulative errors in aggregation across groups as well as the missed groups? This question only gets more difficult for more complex SQL queries possibly with user-defined functions, that are prevalent in the workloads of today’s Big Data systems. Even after we manage to define a satisfactory notion of error metric, we still need to understand for what data distributions and what types of data and for what class of queries our approximation techniques apply.
Of course, the key promise of approximate query processing is that it should be much less expensive than traditional query execution. Sampling is often used for approximate query processing since it is a well-understood general-purpose data reduction technique. Nonetheless, sampling on the fly without the support of a physical access method will in many cases fall short of reducing latency of queries by an order of magnitude. In contrast, the use of pre-computed samples can give us significant speed-ups. However, it turns out that the pre-computation needed to handle small groups and highly selective queries can be significant. Empirical studies have not convincingly shown us that we can achieve an order of magnitude acceleration in executing complex ad-hoc queries using approximate query processing while ensuring small errors and without paying for excessive pre-computation.
Techniques based on online aggregation take a different approach. In this model, partial result of the query will serve as its approximate answer with statistical guarantees while converging to the accurate and complete answer when the query ends. There is a significant body of research in the context of online aggregation, including development of prototype query engines. To ensure the statistical properties of the partial answers, online aggregation requires specialized join operators and relies on randomized data layouts. Extending the benefits of online aggregation to complex SQL queries has been a challenge, much like approximate query processing techniques based on pre-computation.
Reflecting on the journey so far, I wonder if we need to scale back our ambition. It may just be that we cannot have it all – applicability to a wide surface of query language and data distributions, limited precomputation, an order of magnitude reduction of resource needs (or latency), and small errors.
I believe that for arbitrary complex SQL queries it will be very hard to define a notion of error for approximate answers that is at once general as well as useful for application programmers and data scientists. Instead, for such queries, we should extend data platforms with sampling operators that will act as building blocks for injecting approximation by programmers in the queries using their knowledge of applications and data characteristics. The sampling operators help save computation (but not typically by an order of magnitude) or may be used to create a sample from the database that the data scientists use for their data analytic tasks repeatedly. However, to allow the data scientist to leverage sampling, we need to go beyond the rather basic support for uniform random sampling that exists on all query platforms. How can we allow data scientists or developers to sample effectively at the output of a join operator without paying the full cost of the join computation and subsequent sampling? Similarly, to sample the output of a group by operator, we need a new sampling operator. The introduction of these operators also brings new challenges and opportunities in query optimization. Each of these sampling operators will have precise semantics of its output and error “locally”. But, the system will have no built-in guarantees of global error bound on approximation for the end-to-end query.
The other direction that is promising is to focus on a smaller but the important query surface of OLAP that is widely used in business intelligence scenarios. For this class, the pre-computation based techniques can extend the ability of the OLAP systems to give approximate answers to queries over much larger data sets that may not fit in memory. Unlike the case of complex SQL queries, we can provide useful error guarantees for a popular subset of OLAP queries. The pre-computation based approximate query processing techniques enable us to accelerate OLAP queries that if executed in the traditional way, would require significantly more resources or would take a long time (sometimes by an order of magnitude) because of large volume of data. By focusing exclusively on a subset of OLAP queries, these systems can also limit the extent of pre-computation they require.
At Microsoft Research, we are exploring the above two directions that make different trade-offs in their goals for approximate query processing since it seems that we cannot have it all.
Let’s recognize that with approximate query processing we are targeting a difficult task. Traditionally, approximations are injected explicitly by application programmers who are aware of the semantics and the needs of their application. As an example, in the database context, we have used sampling-based techniques effectively for estimating statistics for database objects that are needed for query optimization. In contrast, the goal of approximate query processing techniques is to simplify the job of application programmers who may otherwise have to code custom approximations. Therefore, it is important to carefully examine how well we serve application programmers or data scientists through our approximate query processing techniques. But, this leads to a chicken-and-egg problem because without exposing the approximate query processing functionality broadly, it is impossible to do such evaluation. I believe that approximate query processing is much more likely to reach the users if we can demonstrate convincingly that it can be integrated with today’s query engines with modest changes, giving us an opportunity to do extensive evaluation that will help drive new research in this area.
As data volumes continue to skyrocket in the age of Big Data, the vision of approximate query processing is more compelling than ever. We need our community to take a fresh look at this problem and experiment with disruptive thinking that range from new user interaction modes to novel query processing techniques. The Program Committee of SIGMOD has put together a “Grand Challenge” session on approximate query processing at ACM SIGMOD 2017 that I will participate in. I hope many of you will be there and bring your exciting ideas to unlock the promise of approximate query processing. Additional reflections on this topic are in the paper “Approximate Query Processing: No Silver Bullet” (SIGMOD 2017).
Surajit Chaudhuri is a Distinguished Scientist at Microsoft Research and leads the Data Management, Exploration and Mining group. He also acts as the liaison between Microsoft Research and the Senior Leadership Team of Microsoft’s Cloud and Enterprise Division. Surajit’s current areas of interest are enterprise data analytics, self-manageability and cloud database services. Working with his colleagues in Microsoft Research, he helped incorporate the Index Tuning Wizard (and subsequently Database Engine Tuning Advisor) and Data Cleaning technology into Microsoft SQL Server. Surajit is an ACM Fellow, a recipient of the ACM SIGMOD Edgar F. Codd Innovations Award, ACM SIGMOD Contributions Award, a VLDB 10-year Best Paper Award, and an IEEE Data Engineering Influential Paper Award. Surajit received his Ph.D. from Stanford University.
Copyright @ 2017, Surajit Chaudhuri, All rights reserved.