January 24, 2023
Uncertainty arises naturally in many application domains due to measurement errors, human error in data entry or transformation, missing data and bias in data collection, and many other reasons. When uncertainty is ignored during data preprocessing and analysis, this leads to hard to trace errors which can have severe real world implications such as false incarcerations and medical misdiagnosis. The database community has worked extensively on data discovery, error detection, data repair, data wrangling, and integration techniques. These techniques are used extensively by data scientists to identify high-quality datasets that are relevant for a task, and to clean and integrate data in preparation for analysis. In spite of the significant progress made, in practice, uncertainty is often either ignored or dealt with using methods that do not model the uncertainty in their decision making process and, thus, produce results which cannot categorically be trusted. In this blog post, I am arguing why data cleaning, integration, and related techniques naturally lead to uncertainty and what can be done to model and expose this uncertainty to data analysts to improve the reliability and trustworthiness of analysis results. In turn, this helps users to improve the quality of their data and insights. Furthermore, I explore some why existing principled techniques for modeling and reasoning under uncertainty, i.e., incomplete and probabilistic databases, are not more widely applied in practice. Finally, I introduce the reader to our recent work on uncertainty-annotated databases which, instead of replacing existing techniques that make heuristic, but potentially meaningful, choices, enhances such techniques by augmenting their results with information about data uncertainty. As a running example, let us consider a toy Covid infection rate dataset where attribute locale is a key. Suppose the data was extracted from multiple incomplete sources leading to conflicting information and missing values. In fact, we do not know the infection rate for Houston and do not know the size of Sacramento and we have conflicting infection rates for Los Angeles and conflicting size values for Austin.
locale | rate | size |
---|---|---|
Los Angeles | 3% | metro |
Los Angeles | 5% | metro |
Austin | 7% | city |
Austin | 7% | metro |
Houston | metro | |
Sacramento | 6% |
Unless we address these quality issues, any analysis result based on this data is likely to be flawed. For instance, if we calculate the average infection rate per locale size, the uncertain rates and sizes of some rows will affect the analysis result.
As mentioned above most data we are dealing with is uncertain to some extend. For example, some attribute values may be missing (we are uncertain about the values of the attributes of some rows), collected data may be subject to selection bias (we are missing data about a sub-population and are uncertain about the precise nature of the missing data), data may violate integrity constraints (e.g., we have multiple rows with the same primary key value and are uncertain which one is correct), or when integrating data from multiple sources we may not know the precise relationships between the schemata and instances of these sources (e.g., we may not know whether a title attribute in one source refers to the title of a person or the title of a movie). Most existing data cleaning and integration techniques address such uncertainty by resolving it and return a single deterministic result. For instance, a common technique for imputing missing values is to train a classifier over the data to predict the values of attributes that are missing from a row using the remaining values of the row as input. Another example is constraint-based data repair where the dataset is replaced with a repaired version that fullfills the constraints, e.g., we may fix primary key violations by selecting a single tuple for each set of tuples with the same key value. Consider the possible “fixed” version of our example dataset showns below. We generated this dataset by picking one Los Angeles and Austin tuple to fix the constraint violation, we selected the average infection rate of 5.33% when the missing infection of a row was missing, and did pick the most frequent size for missing size values.
locale | rate | size |
---|---|---|
Los Angeles | 3% | metro |
Austin | 7% | city |
Houston | 5.33% | metro |
Sacramento | 6% | metro |
Of course, actual constrained-based repair and missing value imputation techniques are not that simplistic. Nonetheless, a fundamental problem with this approach is that in almost all cases there is insufficient information available to decide with certainty which possible repair for a data quality issue is correct. Thus, even if we develop increasingly more sophisticated algorithms and throw more computational resources at the problem, we will have to live with the fact that the repaired / integrated datasets produced by our algorithm cannot categorically be trusted to be correct. To clarify, I am not claiming that the techniques our community has developed have no merit and should be avoided, but rather that any algorithm is bound to sometimes make a bad choice, because we do have insufficient information available for determining the correct ground truth. The problem is fundamentally not use of the heuristics for resolving such issues, but that the output of such techniques is treated as ground truth. That is, the output lacks any information about “what could have been” and, thus, in the end we do not know how data quality and integration issues may have impacted the result of our analysis.
A natural way to model is uncertainty in the input data as well as in the right choice of how to address a data quality or integration issue is to replace a deterministic database with an incomplete database [6][1][10]. An incomplete database is a set of deterministic databases called possible worlds – each representing one of possible state of the real world. If we are aware of the likelihood that a possible world represents the unknown true state of the real world, then we can express this knowledge as a probability distribution over the possible worlds, i.e., we get a probabilistic database [15][12]. Queries (or other computations) over these models can be evaluated using the so-called possible worlds semantics where a query is evaluated over each possible world using deterministic query evaluation semantics to produce all possible query answer relations (and their probabilities in case of probabilistic databases). Alternative query evaluation semantics include the certain answer semantics which returns all rows that are guaranteed to be in the query result (they are returned in every possible world) and the possible answer semantics which returns all rows that occur in at least one world.
In general, the possible worlds model is too verbose to be of practical use, e.g., even for just n binary choices, there are 2n possible worlds. More compact models have been proposed in the literature including tuple-independent and block-independent databases [15], X-DBs [2], V-tables [11], c-tables [10], M-tables [13], and many more. Viewing both the uncertainty in data as well as the uncertainty in the decision processes of cleaning, curation, and integration approaches through the lens of incomplete and probabilistic databases will enable us to understand how they approach uncertainty.
Uncertainty? What uncertainty?
Data cleaning and integration approaches which make heuristic choices as discussed above, effectively select and return one world among all the possible worlds encoding all possible ways of cleaning or integrating the data. That is, the output of such algorithms is a deterministic database. We refer to this approach as selected-guess query processing. Selected-guess query processing has the advantage that any query evaluated over the result produced by selected-guess query processing can be evaluated using any standard database system. This advantage in terms of performance and expressiveness of operations over the cleaned and integrated dataset, however, comes at the cost of loosing all information about uncertainty in the input data and in the choices made by the algorithm.
Sorry, but 99.99% certain is not certain!
An alternative approach that has been explored in the context of repairing integrity constraint violations is consistent query answers (CQA) [3]. Given a set of integrity constraints Σ, a database D, and a query Q, CQA computes the set of certain answers over an incomplete database whose possible worlds are all possible repairs of D. A repair is a database D′ that can be derived from D and that fulfills all constraints in Σ1. Note that this is also the approach taken in virtual data integration and data exchange, where only certain answers to queries are returned. The advantage of CQA and certain answers in general is that we can fully trust every query result, because all query answers are known to be 100% certain. However, CQA has high computational complexity and is often too conservative, because any possible answer that is even just slightly uncertain will be ignored. While efficient under-approximations exist (e.g., [9]), they are limited to restricted classes of queries and constraints and even further exacerbate the problem of omitting potentially useful, but uncertain, answers. Another disadvantage is that certain answers are not closed. That is, it is not possible to evaluate queries under certain answer semantics over the certain answers to a query.
1A repair semantics is assumed that governs which changes to D are considered viable, e.g., all consistent databases that can be derived from D by deleting or inserting a minimal amount of tuples.
Here is everything that could be true, deal with it!
At the other extreme, it is possible to use probabilistic query processing to return all possible answer tuples, each with their marginal probability of existing in the answer, or to compute a compact encoding of all possible worlds (with or without probabilities) using, e.g., c-tables. For example, [4] represents the possible outputs of entity resolution as a probabilistic database. The advantage of this approach is that all information about the uncertainty in the input data and uncertainty in choices made by the cleaning or integration algorithm are retained. However, this comes at the cost of even higher computational complexity (probabilistic query processing is #P-hard, even for conjunctive queries). Furthermore, the set of possible answers can be huge or even infinite, overwhelming the user with too much information. While some models such as m-tables [13] can compactly encode large number of options, this typically comes at the cost of interpretability.
Given the disadvantages of all three approaches discussed above, a pressing question is whether it is possible to imagine an approach whose efficiency is can competite with the selected-guess approach, but which compactly encodes useful information about certain and possible answers. In recent work [7][8], in a collaboration between IIT’s database group and the Odin Lab at University of Buffalo, we have answered this question affirmatively. Specifically, we have developed a light-weight uncertain data model called Uncertainty-annotated databases (or UA-DBs for short). A UA-DB consist of one distinguished world (called the selected-guess world) which is annotated with information about what facts are known with certainty and what facts could be true (are possible). The selected-guess world can be the most likely world of a probabilistic database (if we have probabilities and this world can be computed efficiently) or can represent the heuristic choice made by a data cleaning or integration algorithm. That is, UA-DBs are backward compatible with existing data curation algorithms. However, additionally, they encode an under-approximation of all certain facts (facts that are true in all worlds) and a compact over-approximation of all possible facts (facts that are true in some world). The reason for using approximations is that this allows us to circumnavigate the inherent computational complexity of computing certain and possible answers. These approximations are expressed using two types of uncertainty: (i) attribute-level uncertainty is expressed using triples [ c↓, csg, c↑ ] as the value domain for tables. c↓ is a lower bound on the possible value of a tuple’s attribute, csg is the value a tuple has in this attribute in the selected-guess world, and c↑ is an upper bound on the attribute’s value; (ii) each row is associated with a triple of multiplicities [ n↓, nsg, n↑ ] encoding bounds on the number of times a tuple with values within the tuple’s attribute bounds appears in any world and the number of times the tuple appears in the selected-guess world. Given an incomplete database such as the one of our running example, we can construct a UA-DB that bounds this database, i.e., it under-approximates certain facts in the database, over-approximates possible facts, and encodes the selected-guess world. For instance, the UA-DB shown below bounds our running example uncertain database (assuming that infection rates are between 1% and 10%) using the output of data repair as shown above as the selected-guess world. All four tuples certainly exist (their multiplicity lower and upper bound is 1). Consider the tuple for Austin: it’s infection rate is certainly 7% while the size of Austin could be either city or metro (and is city in the selected-guess world).
locale | rate | size | N3 |
---|---|---|---|
Los Angeles | [3%,3%,5%] | metro | (1,1,1) |
Austin | 7% | [city,city,metro] | (1,1,1) |
Houston | [1%,5.33%,10%] | metro | (1,1,1) |
Sacramento | 6% | [town,city,metro] | (1,1,1) |
In addition to enriching a selected-guess world, a major advantage of UA-DBs is their PTIME data complexity and that they are closed under a large class of queries including full relational algebra with aggregation, windowed aggregation, and sorting and top-k queries. Queries over UA-DBs are bound-preserving, i.e., given an input that bounds an incomplete databases, the result of any such query bounds all possible query results. We have implemented an earlier version of the UA-DB model where attributes-values are marked as certain or uncertain (instead of the more precise attribute ranges) in our Vizier notebook system [5][14]. Vizier implements so-called Lenses [16] which are standard data cleaning and integration algorithms that have been enriched to process and return UA-DBs. For instance, a missing value imputation lens applies a standard imputation algorithm to produce its selected-guess result and marks imputed attribute values as uncertain. Queries in Vizier are evaluated using UA-DB semantics. Query results are shown using a visual representations of uncertainty that are natively supported by the system. For example, a plot created over an uncertain query result will contain markers to indicate which results are uncertain while cells in a tubular result that are uncertain are highlighted to distinguish them from certain results.
In summary, in this blog post I have explored how uncertainty unavoidably shows up in most data cleaning and integration tasks, discussed why uncertainty needs to be modeled to assess the trustworthiness of analysis results, and presented UA-DBs as a possible “best of both worlds” solution that enriches existing heuristic cleaning and integration operations with an approximate representation of uncertainty that can be computed efficiently for expressive queries. We are actively working on further extending UA-DBs to support even more expressive queries (e.g., recursive or iterative computation), to further improve the performance of queries and computations over UA-DBs, and to further extend the expressiveness and approximation guarantees of the model.
I am grateful to my awesome Vizier collaborators from University at Buffalo, NYU, and last but not least our team at Breadcrumb Analytics. Special thanks go to my Ph.D. student Su and Oliver and his student Aaron, my main UA-DB partners in crime.
Boris Glavic is an Associate Professor in the Department of Computer Science at Illinois Institute of Technology leading the IIT DBGroup. His research spans several areas of database systems and data science including data provenance, data integration, query execution and optimization, uncertain data, and data curation. Boris strives to build systems that are based on solid theoretical foundations.
1] Abiteboul, S., Kanellakis, P. and Grahne, G. 1991. On the representation and querying of sets of possible worlds. Theoretical computer science. 78, 1 (1991), 159–187.
[2] Benjelloun, O., Sarma, A.D., Halevy, A.Y. and Widom, J. 2006. ULDBs: Databases with Uncertainty and Lineage. Proceedings of the 32th international conference on very large data bases (vldb) (2006), 953–964.
[3] Bertossi, L. 2011. Database repairing and consistent query answering. Synthesis lectures on data management. 3, 5 (2011), 1–121.
[4] Beskales, G., Soliman, M.A., Ilyas, I.F., Ben-David, S. and Kim, Y. 2010. Probclean: A probabilistic duplicate detection system. Proceedings of the 26th international conference on data engineering, ICDE 2010, march 1-6, 2010, long beach, california, USA (2010), 1193–1196.
[5] Brachmann, M., Bautista, C., Castelo, S., Feng, S., Freire, J., Glavic, B., Kennedy, O., Müller, H., Rampin, R., Spoth, W. and Yang, Y. 2019. Data debugging and exploration with vizier. Proceedings of the 44th international conference on management of data (demonstration track) (2019).
[6] Console, M., Guagliardo, P., Libkin, L. and Toussaint, E. 2020. Coping with incomplete data: Recent advances. Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems, PODS 2020, portland, or, usa, june 14-19, 2020 (2020), 33–47.
[7] Feng, S., Huber, A., Glavic, B. and Kennedy, O. 2021. Efficient uncertainty tracking for complex queries with attribute-level bounds. Proceedings of the 46th international conference on management of data (2021), 528–540.
[8] Feng, S., Huber, A., Glavic, B. and Kennedy, O. 2019. Uncertainty annotated databases – a lightweight approach for approximating certain answers. Proceedings of the 44th international conference on management of data (2019), 1313–1330.
[9] Geerts, F., Pijcke, F. and Wijsen, J. 2017. First-order under-approximations of consistent query answers. International journal of approximate reasoning. 83, (2017), 337–355.
[10] Imieli\\’nski, T. and Lipski Jr, W. 1984. Incomplete Information in Relational Databases. Journal of the acm (jacm). 31, 4 (1984), 761–791.
[11] Lipski Jr, W. 1984. On relational algebra with marked nulls. Proceedings of the 3rd acm sigact-sigmod symposium on principles of database systems (1984), 201–203.
[12] Suciu, D., Olteanu, D., Ré, C. and Koch, C. 2011. Probabilistic databases. Synthesis lectures on data management. 3, 2 (2011), 1–180.
[13] Sundarmurthy, B., Koutris, P., Lang, W., Naughton, J. and Tannen, V. 2017. M-tables: Representing missing data. Lipics-leibniz international proceedings in informatics (2017).
[14] Team, V. 2022. Vizier Webpage. https://vizierdb.info/.
[15] Van den Broeck, G. and Suciu, D. 2017. Query processing on probabilistic data: A survey. (2017).[16] Yang, Y., Meneghetti, N., Fehling, R., Liu, Z.H. and Kennedy, O. 2015. Lenses: an on-demand approach to etl. Proceedings of the vldb endowment. 8, 12 (2015), 1578–1589.