September 13, 2023

Databases, Query Processing No Comment

In the last decade, the database community has identified cardinality estimation as the primary stumbling block for modern query optimizers. Cardinality estimates, which estimate the size of sub-plan queries, are the primary basis for choosing between query plans, so poor estimates may result in catastrophic query execution plans. Research on this topic has consistently emphasized a few findings: 1) traditional query optimizers regularly underestimate by orders of magnitude, 2) the bad query plans are often orders of magnitude slower than good ones, and 3) underestimates are worse than overestimates.

To see this, consider a few graphs from the seminal 2015 benchmarking paper “How good are query optimizers really?”. The first shows the performance of a few commercial query optimizers on the cardinality estimation problem. Even a series of battle-tested commercial optimizers consistently underestimate by orders of magnitude on this benchmark. Intuitively, this is because they all rely on some form of independence assumptions to simplify the problem. For instance, many of them assume that WHERE predicates on different columns randomly and independently filter rows. However, this assumption is almost always violated because columns are generally correlated, e.g. city and state columns.

The second figure demonstrates the performance of all possible plans for a few queries under different index configurations. For most queries, performance differs wildly between plans. A small proportion (~5%) of plans are within 10x of the optimal plan’s performance, but the vast majority of plans are 2-4 orders of magnitude worse. In a perfect world, the query optimizer would simply choose the optimal plan, but this would require perfect cardinality estimates, which is impossible. Given this limitation, it makes sense to aim for an optimizer that simply avoids catastrophically bad plans. Most of these plans contain a sub-plan with a large intermediate join result which brings us to the core problem of underestimating cardinalities; A single underestimate of a large intermediate result can push the query optimizer to include that sub-plan, resulting in an overall plan that is much, much slower than the optimizer expects. For example, suppose that cardinality estimator A produces perfect cardinality estimates for 99% of intermediate results but drastically underestimates 1% of them. Despite estimator A’s incredibly high average accuracy, the query optimizer will always select plans that incorporate these underestimated intermediates. On the other hand, suppose that cardinality estimator B produces perfect cardinality estimates for 99% of intermediates and drastically overestimates 1% of them. In this case, the query optimizer will simply ignore plans with bad estimates. If a single good plan is accurately estimated, then the optimizer will still produce a good query plan. In other words, systems that overestimate only need to be accurate for *at least one* good plan, while systems that underestimate need to be accurate for *every* bad query plan. This asymmetric risk is why recent benchmarks have shown much better performance when systems consistently produce overestimates.

Proposed methods for this problem have generally fallen into one of three categories: traditional, learned, and bound-based. Each of these angles has its own benefits and drawbacks to consider. We summarize the methods in all these categories, along with our newly proposed methods (FactorJoin and Safebound), in the following table.

The traditional cardinality estimators are the most commonly used method in DBMSes. The traditional histogram-based methods generally adopt attribute independence and join uniformity assumptions to decompose the join queries as a combination of single table estimates. They are very efficient, easy to train, update and maintain, thus perfect for system deployment. However, their simplifying assumptions can generate erroneous estimates and poor query plans.

Sampling-based methods also have a long history in the database community. There are two fundamental issues facing sampling methods: the exponential decline of sample sizes as the number of joins increases and how to handle estimates when the sample doesn’t include any satisfying tuples. To illustrate the first problem, consider a simple FK-PK join where rows are sampled from R and S independently with probability p. The likelihood that a row in the sample of R joins with a row in the sample of S is 1/p. Immediately, the sampling probability for tuples in R effectively goes from 1/p to 1/p^2. This becomes exponentially worse as more tables are added to the query. For the second, consider a query with a filter predicate that doesn’t match any tuples in our sample of R. At this point, the estimation algorithm has some information to infer the number of tuples in R, but it has no information to estimate how the filtered subset of R would interact with other tables in the query. The most promising sampling methods, such as WanderJoin, tend to avoid this problem by using dynamic sampling rather than computing samples offline, and they achieve very robust accuracy. However, this introduces further concerns about the maintenance of indices, the latency of accessing data during optimization, and increased code complexity.

Τhe learned cardinality estimators use the machine learning (ML) model to relax or eliminate the simplifying assumptions made by traditional methods. Their sophisticated ML models can provide accurate estimation but are subject to expensive training/updating costs and slow inference speed.

Query-driven ML methods learn cardinality estimates by running an example workload of queries and training a traditional ML model from the resulting (query, cardinality) pairs. These methods can achieve high accuracy on the queries which come from the same distribution as the example workload, and they are often very fast to both train and perform inference. However, running these example queries can be very time-consuming, and they often suffer significantly when faced with updates or queries that differ significantly from the example workload.

Data-driven ML methods use unsupervised approaches which attempt to learn the data distribution directly rather than training on example queries. This allows for generalization to different query workloads and can be accurate depending on the complexity of the underlying models. However, the model complexity needed for accurate estimates grows quickly due to the inherent challenge of modeling all the complicated non-linear relationships between columns both within and between tables. Because of this, they generally take a long time to train and are slow to perform inference on, and have problems scaling to complicated schemas.

The last approach is grounded in database theory and provides provable upper bounds on the query size rather than an average-case estimate. When used in place of traditional estimates, this guarantees that there will be no underestimates. Because bounds on query sizes are helpful for complexity analysis, these bounds have been thoroughly studied by the theory community and have motivated areas of algorithm design, such as worst-case-optimal join algorithms. However, the bound formulas, which are interesting to theoreticians, have generally relied on minimal statistics about the underlying dataset, e.g., the size of each table and the maximum frequency of each column. This provides good interpretability and insight, but it generally does not result in tight bounds that could be practically used by a query optimizer. The paper “Pessimistic Cardinality Estimation” was the first work to attempt to adapt these formulas to cardinality estimation, and it managed to produce fairly tight bounds, which significantly improved query execution time. However, it couldn’t handle filter predicates (e.g., “R.A < 10”), and its inference latency increased exponentially in the granularity of its statistics.

To effectively support filter predicates and enable efficient inference, we describe two novel practical bound-based cardinality estimators: FactorJoin and SafeBound. FactorJoin formulates the problem of cardinality estimation of general multi-table join queries as a well-studied inference problem in the field of probabilistic graphical models. FactorJoin can quickly construct a factor graph model from the database schema in the offline phase, which will be used to efficiently deliver a probabilistic cardinality bound for any query. SafeBound draws on recent theoretical advances in cardinality bounding to produce deterministic bounds using degree sequence statistics.

These two approaches inherit the robustness of bound-based cardinality estimators while overcoming their drawbacks to provide flexible and efficient inference. The experimental results demonstrate that FactorJoin and SafeBound can achieve up to 80% lower end-to-end runtimes than PostgreSQL on well-established benchmarks. This performance is comparable to or better than the state-of-the-art (SOTA) machine learning methods. Furthermore, FactorJoin can achieve 40x less estimation latency, a 100x smaller model size, and 100x faster training/updating than the previous SOTA methods. SafeBound can save up to 500x in query planning latency and 6.8x less space when compared to previous SOTA methods.

Before looking at the formulation, let’s go through a simple example of a query joining two tables. The figure above illustrates query 𝑄 joining tables 𝐴 and 𝐵 on the inner join1 condition 𝐴.𝑖𝑑 = 𝐵.𝐴𝑖𝑑 with base table filter predicates 𝑄(𝐴) and 𝑄(𝐵). Table 𝐴 first goes through the filter 𝑄(𝐴), resulting in an intermediate table 𝐴|𝑄(𝐴) (records in 𝐴 that satisfy the filter 𝑄(𝐴)). The same procedure is applied to table 𝐵. Then, the query 𝑄 will match the value of join keys 𝐴.𝑖𝑑 and 𝐵.𝐴𝑖𝑑 from these two intermediate tables. Specifically, the value 𝑎 appears 8 times in table 𝐴|𝑄(𝐴) and 6 times in table 𝐵|𝑄(𝐵), resulting in value 𝑎 appearing 48 times in the join result. Therefore, we can calculate the cardinality of this query 𝑄 as 8 × 6 + 4 × 5 + 3 × 5 = 83.

We can formulate the above procedure for calculating 𝑄 as a statistical equation, where 𝐷(𝐴.𝑖𝑑) denotes the domain of all unique values of 𝐴.𝑖𝑑. We observe that only single-table distributions 𝑃 and 𝑃 are required to accurately compute the 𝐴𝐵 cardinality of this join query. Also in this equation, 𝑃_{𝐴}(𝐴.𝑖𝑑 = 𝑣|𝑄(𝐴)) ∗ |𝑄(𝐴)| equals to 𝑃_{𝐴}(𝐴.𝑖𝑑 = 𝑣 ∧ 𝑄(𝐴)) ∗ |𝐴|, which is exactly what single-table CardEst methods estimate. Thus, we can accurately calculate the cardinalities of two-table join queries using single-table estimators. Note that the summation over the domain of join key D(𝐴.𝑖𝑑) has the same complexity as computing the join. Therefore, we need to approximate this calculation for FactorJoin to be practical.

A general join query can involve a combination of different forms of joins (e.g., chain, star, self, or cyclic joins), so its counterpart equation of Equation 1 can be very difficult to derive and compute. We provide a generalizable formulation that automatically decomposes join queries into single-table estimations using the factor graph model. Let’s take a look at a more complicated example below.

Figure (i) shows a SQL query 𝑄 joining four tables 𝐴,𝐵,𝐶,𝐷. We visualize its join template as a graph in Figure 3-(ii), where each dashed rectangle represents a table, each ellipse (node) represents a join-key in 𝑄, and each solid line (edge) represents an equi-join relation between two join keys connected by it. Note that both sides of an equi-join relation represent the semantically equivalent join keys. We call them equivalent key group variables. In Figure (ii), there are three connected components and thus three equivalent key group variables𝑉_{1},…,𝑉_{3}. 𝑉_{1} represents 𝐴.𝑖𝑑 and 𝐵.𝐴𝑖𝑑 in this group since 𝑄 contains the join condition 𝐴.𝑖𝑑 = 𝐵.𝐴𝑖𝑑.

Next, we explain how to compute the cardinality of 𝑄 using a factor graph. Let’s first explain what a factor graph is. A factor graph is a bipartite graph with two types of nodes: variable nodes, and factor nodes representing an unnormalized probability distribution w.r.t. the variables connected to it. Figure (iii) shows the constructed factor graph *F* for computing the cardinality of 𝑄. Specifically, *F* contains a variable node for each equivalent key group variable 𝑉, and a factor node for each table 𝑖 touched by 𝑄. A factor node is connected to a variable node if the variable represents a key in the table. In this case, the factor node representing table 𝐴 is connected to 𝑉_{1} (equivalent to 𝐴.𝑖𝑑) and 𝑉_{2} (equivalent to 𝐴.𝑖𝑑_{2}). Each factor node maintains an unnormalized probability distribution for the variable nodes connected to it, e.g., node 𝐴 maintains 𝑃_{𝐴}(𝑉_{1},𝑉_{2}|(𝑄(𝐴))) ∗ |𝑄(𝐴)|, which is the same as the distribution 𝑃_{𝐴}(𝐴.𝑖𝑑, 𝐴.𝑖𝑑_{2}|(𝑄(𝐴))) ∗ |𝑄(𝐴)|. The factor graph model utilizes the graph structure to compute the sum using well-studied inference algorithms. We can more formally state this relationship between computing the join cardinalities and factor graphs as the following lemma, whose proof is provided in the original paper.

**Lemma1**: Given a join graph *G* representing a query 𝑄, there exists a factor graph *F* such that the variable nodes in *F* are the equivalent key group variables of *G* and each factor node represents a table 𝐼 touched by 𝑄. A factor node is connected to a variable node if and only if this variable represents a joining key in table 𝐼. The potential function of a factor node is defined as table 𝐼’s probability distribution of the connected variables (join keys) conditioned on the filter predicates 𝑄(𝐼). Then, calculating the cardinality of 𝑄 is equivalent to computing the partition function of* F*.

The computation of such probabilistic inference on this factor graph is very expensive. The complexity is O(N * |D|^{max(|JK|)}), where N is the number of equivalent key groups (3 in the above example), |D| is the largest domain size of all join keys, and 𝑚𝑎𝑥(|𝐽𝐾|) is the maximum number of join keys in a single table (2 in the above example). This complexity of conducting exact inference is not practical for real-world queries, as |D| can be millions, and 𝑚𝑎𝑥(|𝐽𝐾|) can be larger than 4 in real-world DB instances, such as IMDB.

Therefore, in the following, we will show how FactorJoin designs approximate probabilistic inference on factor graph to estimate the cardinality, which reduces the complexity to O(N * k^{2}) where N (less than 10) and k are typically small (hundreds).

The key idea behind our bound-based approximate probabilistic inference on factor graph is to bin the domain D of the join key into k bins where we can compute an upper bound of cardinality within each bin. This helps reduce the complexity |D| to k, and users can arbitrarily set the number of bins k for performance-latency trade-offs. For example, Equation (1) above can be approximated as Equation (2).

Now, we will illustrate the probabilistic bound the within each bin using a simple example query Q joining two tables. Specifically, assuming that value {𝑎, 𝑏, 𝑐, 𝑒, 𝑓 } of 𝐴.𝑖𝑑 and 𝐵.𝐴𝑖𝑑 are binned into 𝑏𝑖𝑛1 as shown in Figure 5. We know the summation of all values in bin_{1} equals to 8×6+4×5+3×5=83. This summation has a dominating term of 8 × 6, because the count of MFV of bin_{1} is 8 for 𝐴.𝑖𝑑 (denoted as 𝑉_{1}^{*}(𝐴.𝑖𝑑)) and 6 for𝐵.𝐴𝑖𝑑 (denoted as 𝑉_{2}^{*}(𝐵.𝐴𝑖𝑑)) so each value can appear at most 8 × 6 times in the denormalized table after the join. Since we know the total count of values in bin_{1} for 𝐴.𝑖𝑑 is 16, there can be at most 16/8 = 2 MFVs. Similarly, there can be at most 4 MFVs in bin_{1} for 𝐵.𝐴𝑖𝑑. Therefore, we have the summation of all values in bin_{1} is upper bounded by 𝑚𝑖𝑛(2, 4) × 8 × 6 = 96.

We formally represent the aforementioned procedure in Equation (3),where 𝑉_{i}^{∗}(𝐴.𝑖𝑑) and 𝑃_{𝐴}(𝐴.𝑖𝑑 ∈𝑏𝑖𝑛_{i}|𝑄(𝐴)) ∗ |𝑄(𝐴)| are the MFV count and estimated total count of 𝑏𝑖𝑛_{i} for𝐴.𝑖𝑑. Our bound is probabilistic because 𝑃_{𝐴}(𝐴.𝑖𝑑 ∈ 𝑏𝑖𝑛_{𝑖}|𝑄(𝐴)) ∗ |𝑄(𝐴)| is estimated with a single table CardEst method, which may have some errors.

We evaluate the performance of our CardEst framework on two well-established benchmarks: STATS-CEB and IMDB- JOB and compare with a wide range of representative baselines, including traditional methods (PostgreSQL, JoinHist, WJSample), learned methods (MSCN, BayesCard, DeepDB, FLAT), bound-based methods (PessEst, U-Block) and the optimal baseline with true cardinality (TrueCard). The overall performance plots and tables are given below.

To summarize, FactorJoin achieves the best performance among all the baselines on both benchmarks. Specifically, FactorJoin is as effective as the previous state-of-the-art learned methods, and simultaneously as efficient and practical as the traditional CardEst methods.

Our goal in SafeBound was to make the theoretical work on cardinality bounding practical without sacrificing the deterministic, provable nature of the bounds. So, to understand SafeBound, we need to start by introducing the Degree Sequence Bound, which SafeBound is based on, and the degree sequence statistics, which that is based on.

The degree sequence of a column in a database is the *sorted list of value frequencies*. As an example, consider the “Name” column below and the accompanying degree sequence on its right. The name “Eseah” is the most frequent value and appears 5 times, so there is a 5 at the start of the degree sequence. Carlos and Vivek both appear 3 times and so on.

This statistic has a few nice characteristics that we take advantage of in our work: 1) it fully captures the skew of join columns, 2) it is monotonic and decreasing, 3) we can upper bound it very efficiently with simple piecewise constant functions. We can see all of these on a degree sequence from the IMDB dataset below. Specifically, this is the ActorID column in the CastInfo table, where each row represents a single actor being cast in a single movie or TV show. The most-cast actor appeared ~10,000 times, while over half of the actors appeared just a single time. Further, the true degree sequence is 4 million data points because there are over 4 million unique actors, but the approximation shown in green is only 7 segments while largely capturing the shape of the curve.

Without going into the mathematical details, the Degree Sequence Bound can be thought of as a black box algorithm. It takes in 1) a simple join query (only joins, no selections) and 2) the degree sequence approximations of every join column, and it outputs a provable, deterministic bound on the size of the query’s output. Further, it does this in log-linear time with respect to the size of the approximations, so it’s very fast when the approximations are small (~20 segments in our experiments).

More details about this bound can be found in our theory paper here.

Given that we have an effective algorithm for bounding join sizes given degree sequences, the main obstacle to bounding real queries is handling selection predicates (e.g., LIKE, =, <,>, NOT NULL). These predicates filter a relation, and the resulting relation often has a drastically different set of degree sequences. For instance, if we looked at the CastInfo table from above, a common predicate might be to filter for a particular movie (“CastInfo.MovieId = 12345”). The filtered degree sequence for the MovieId column would have a single element whose frequency is equal to the number of actors in the film. The naive approach to handling this would be to scan the table every time a predicate appeared and calculate the new degree statistics. However, this would take linear time in the size of the data, which is generally considered unacceptable for query optimization, particularly in the case of distributed systems.

To tackle this problem, SafeBound adapts traditional summary statistics such as histograms, most-common value lists, and n-grams to the cardinality bounding setting. For starters, traditionally, you would store the number of rows in each bucket, value, or n-gram, but SafeBound stores the degree sequences of those rows. However, more subtle changes are required as well because simple operations like addition and multiplication don’t work well with degree sequences and/or result in true upper bounds. For instance, consider a range predicate that spans multiple buckets of a histogram. Without additional information, we need to do a pointwise addition of the degree sequences from each bucket. This assumes that the most (second most, third most, etc.) frequent value is the same across buckets and results in a very loose bound. To avoid this, we use overlapping, dyadic range intervals, as shown below.

Consider a range predicate Instead of simply adding the degree sequences of the [0,1], [1,2], [2,3], [3,4] buckets, we find the smallest bucket which fully encapsulates our range predicate and use the degree sequences stored there, resulting in much more accurate bounds.

We similarly adapt MCV lists and n-grams, as described in the paper.

The base of SafeBound is the degree sequence approximations, and it’s not immediately clear how to build these approximations while balancing speed, accuracy, and space constraints. To handle this, SafeBound uses a novel two-pass algorithm called ValidCompress which takes inspiration from the PGM learned index and, more distantly, convex hull algorithms.

At a high level, this algorithm aims to minimize the bound for the self-join query by limiting the error induced by each segment to a particular threshold. By choosing the self-join as our metric, we rely on the assumption that tables join with similarly skewed tables. In the first pass of the algorithm, it calculates the self-join bound based on the exact degree sequence, which happens to be equivalent to the second moment of the degree sequence. In the second pass, it constructs piecewise segments one at a time, starting a new sequence every time the squared error exceeds some percent of the exact bound.

While the full suite of experimental findings can be found in the paper, the most important findings are on the overall workload runtime. This directly measures the quality of the plans produced when using each cardinality estimation system. Specifically, we injected the estimates from each system into the query optimizer of Postgres and measured the total runtime of the workload relative to the runtime when Postgres is given the exact cardinalities for every sub-plan.

SafeBound achieves a roughly optimal overall runtime across all four benchmarks, while a traditional estimator like Postgres’ built-in estimator is up to 75% slower. Further, compared to ML methods (BayesCard & Neurocard), SafeBound has **3-500x** faster inference, **3-6.8x** smaller memory footprint, and **2-17x** faster construction time. Compared to PessEst, it has **100x** faster inference.

Kyle Deeds is a 4th year PhD student at the University of Washington advised by Dan Suciu and Magda Balazinska. His research interests include cardinality estimation, theoretical bounds on conjunctive queries, and array programming. Broadly, he wants to apply database optimization techniques to new computing paradigms.

Ziniu Wu is 2^{nd} year PhD student at MIT data system group, advised by Samuel Madden and Tim Kraska. His research focuses on cardinality estimation, query optimization, and new data mesh architecture.

^{1}Leis, Viktor, et al. “How good are query optimizers, really?.” *Proceedings of the VLDB Endowment* 9.3 (2015): 204-215.

^{2}Kipf, Andreas, et al. “Learned Cardinalities: Estimating Correlated Joins with Deep Learning.”; Dutt, Anshaman, et al. “Selectivity estimation for range predicates using lightweight models.”; Wu, Peizhi, and Gao Cong. “A unified deep model of learning from both data and queries for cardinality estimation.” *Proceedings of the 2021 International Conference on Management of Data*. 2021.

^{3}Yang, Zongheng, et al. “NeuroCard: one cardinality estimator for all tables.” *Proceedings of the VLDB Endowment* 14.1 (2020): 61-73; Zhu, Rong, et al. “FLAT: fast, lightweight and accurate method for cardinality estimation.” *arXiv preprint arXiv:2011.09022* (2020).