We built a tool, EpiPolicy, to help policy-makers better plan interventions to combat epidemics [13]. It was an eye-opening experience, where through collaborations and interviews with teams of epidemiologists, public health officials, and economists, we understood some of the complexities of decision-making on a momentous scale. Decisions and policies made by these teams can seriously impact the lives of millions of individuals physically, socially and economically. What we found is that technological inconveniences can frustrate the important work of these teams. We summarize many of these issues, such as poor support for collaboration, lack of data and model management systems, difficulties in maintaining and modifying custom-coded disease models, in our 2021 paper [13]. As database systems researchers, we can bring a lot of our expertise to bear on real-world decision- and policy- making processes. Moreover, we need to expand our research horizon to support the emerging prescriptive analytical demands of decision-makers. Prescriptive analytics tells us what actions to take given the data. It contrasts with descriptive or predictive analytics, areas our community has extensively focused on, which describe, summarize, cluster, or classify data or extrapolate current data into the future.
Any college student who takes a database systems course learns two main ideas that are simple, yet extremely powerful. First, they learn about the power of data independence: by unburdening programmers of the need to specify low-level data access path details, we separate the concerns of how to store and structure data from how to design the applications that access or update this data. This allows logical structure and physical storage updates to occur without the need to rewrite application code that sits on top of a database management system. Second, they learn about the power of a declarative query language: by letting application programmers express what data they wish to retrieve rather than how to retrieve it, we empower a query optimizer to find an optimal query execution plan, one that works regardless of the underlying relational database system, hardware configuration or data scale. These principles do not relate only to the design of relational database systems but as Edgar F. Codd points out in his 1982 Turing Award lecture, “Relational Database: A Practical Foundation for Productivity”, they relate more generally to productivity. I will explain how these two principles directly influenced the design of EpiPolicy and the productivity of its end-users.
Combating epidemics entails constructing a disease model, modeling the mobility patterns of a population, understanding the costs and benefits of each intervention, and designing a strategy for interventions. The disease model describes how individuals in a population move through different states of a disease such as being susceptible to a certain pathogen, being exposed to it, infected, recovered, or immune. In general, there are many ways to implement disease models that simulate how a population is affected by a disease. In EpiPolicy, we focus on deterministic, compartmental models that are described by a series of ordinary differential equations. The precise details of how these models simulate disease spread are not immediately relevant to our discussion here but for the interested reader, here is a simple primer [11]. The mobility patterns describe how different groups within a population interact in disease-spreading ways within facilities and across geographic regions. Finally, the set of disease-mitigating interventions work to modify parameters of the disease model to control how a population transfers from one disease compartment to another, or to modify mobility patterns. For example, a mask-wearing mandate can reduce infection rate, hence controlling how much of the population moves from a susceptible state to an infected one. Vaccinations can move susceptible individuals to ‘immune’ compartments and border closures can limit the movement of infected individuals across geographic borders reducing disease exposure in certain locales.
In our discussions with public health teams, we found that to analyze the effects of a certain set of interventions, they often engaged in a laborious and time-consuming process of writing custom code, or rewriting existing code, to include novel interventions or modify existing ones. In many cases, a change in the underlying disease model was also required to support certain interventions. Worse, once a set of interventions was implemented, it was difficult to consider a wide range of alternative schedules — when and where certain interventions are applied — as the code often hard-coded a specific schedule. Putting on our database systems engineering hats, we saw an opportunity here for improvement.
First, we separated the implementations of disease models and mobility patterns, of interventions, and of intervention application schedules. Creating independence between these components enhanced team collaboration: different experts can work on specifying several disease models and surfacing their core parameters; mobility patterns can be independently coded or selected from a library of known patterns; and with basic programming, sophisticated interventions can be specified using a clean API that only requires each intervention to be described using two functions: effect and cost. Now, changes in one component did not require significant code rewrites, boosting the efficiency with which a team can explore multiple different disease scenarios and policies, and can engage in what-if analysis: what if the disease was more transmissible? what if we vaccinate certain population groups first? what if we extend mask-wearing mandates? Moreover, as more data became available on a certain disease, the independence of the disease model not only from the other components but also from its parameters allows end-users to immediately update parameters without digging through thousands of lines of codes to find the correct parameter to update.
Second, we created a declarative interface to EpiPolicy, where each intervention is described declaratively in terms of its effects and costs but not in terms of when, where or even how (e.g., the vaccination rate for a vaccination intervention, the degree of workplace closures). This empowered EpiPolicy to utilize a reinforcement learning optimizer to search for a schedule of interventions that minimizes overall disease burden (e.g., infections, hospitalizations, deaths) and economic costs.
The coronavirus pandemic put a spotlight on the challenges of data-driven decision-making and policy-making worldwide. As data on infections, hospitalizations, deaths, mobility, etc. poured in, and predictive models painted a multitude of many possible future worlds, some bleak and some unrealistically cheery, decision-makers were left with pressing “what now?” questions that were at times difficult to connect to the available data: Which facilities should we re-purpose for field hospitals or vaccine distribution centers? Should we build new ones? How to allocate limited vaccines within cities, elderly homes and care centers, or prisons? When to enforce or ease-up different interventions such as shelter-in-place, school closures, mask mandates, etc.? And even now as the pandemic is starting to wane, the decision-making woes remain: How to seat fans in a stadium to maintain social distancing? How to schedule employees in hybrid work settings? Which stocks to invest in, given global inflation, food shortages, …, war?
For many of these problems, the biggest gap in terms of technological support is not with understanding the data, deriving insight or building predictive models, but rather it is with using the data to meaningfully inform decisions and policies. There are many reasons why as a community we are best positioned to take on the work of building systems for prescriptive analytics other than it being a natural next step beyond the descriptive and predictive analytics support we are tirelessly working to finesse [10]. First, as I illustrated above, we know a few design principles that can improve how decision-making teams work. Second, traditional solutions to prescriptive analytics are typically application-specific, complex, and do not generalize: the usual workflow requires slow, cumbersome, and error-prone data movement between a database and predictive-modeling and optimization software. Integrating decision-making support into the database systems,, where the data is already managed just makes sense. Finally, we understand how to design systems that can scale, and as we collect and produce more and more data, scalability is key.
In our work, we began chipping away at the goal of supporting in-database prescriptive analytics1 by focusing on constrained optimization problems. Many decision-making problems naturally take this form. Consider an investor having access to a database of current stock prices and predictions of their future prices (assumed, for the moment, to be known with perfect accuracy), with gain estimated as the difference of these two prices. Which stocks should she buy given her principal budget — a constraint — and her desire to make as much profit as possible — an optimization objective? Starting in 2014, we have been building scalable systems to support these forms of constrained optimization problems [5, 6, 8, 4, 3, 2, 7, 9]. Using an intuitive SQL-like language, package query language (PaQL), the investor can simply express her problem as the package query in Listing 1. We created a divide-and-conquer algorithm, SketchRefine, that approximately solves similar integer linear programs (ILPs) over millions of variables in seconds — a feat that state-of-the-art solvers fail at.
1We are not alone in this endeavor, other researchers have also examined how to directly integrate solvers into database management sys-
tems [12].
Briefly, to suggest which stocks to buy to our investor, SketchRefine first partitions the stocks into groups, where the stocks in each group have similar features, and then it computes a representative stock for each group. A small integer linear program (ILP) using only the representatives can be then easily solved — the “sketch”. The sketch is then iteratively “refined” by carefully replacing each representative using the actual stocks that it represents. The process maintains feasibility of the current solution until the final package is obtained. Limiting the size of each group to be a small number of stock tuples ensures that each refine phase can be executed efficiently and allows for approximation guarantees. What is worth pointing out here is that ideas that are innate to database systems researchers like pre-partitioning and sketching can lead to orders-of-magnitude performance improvements to the decades-old ILP problem when applied to massive data that yield a proportional number of decision variables.
Of course, future prices are uncertain in real life, so we extended PaQL to stochastic PaQL (sPaQL). sPaQL supports the expression of stochastic constrained optimization problems. Our investor can now describe probabilistic constraints and expectations over the optimization objective that better reflect the uncertain nature of the predicted stock prices as seen in Listing 2. Again, we created a scalable mechanism, SummarySearch, to execute this query that outperforms state-of-the art solvers.
The gist of SummarySearch is to first create multiple Monte Carlo samples, or scenarios, from a probabilistic database of stocks. This is done by sampling from the predictive models that describe the future stock prices. We can then replace the expectation objective to maximize gain with an empirical average over the set of randomly generated scenarios and we can replace probabilistic constraints — e.g., 90% chance that loss does not exceed a certain amount — with a requirement that the inequality holds over 90% of the generated scenarios. Given this reformulation of a stochastic linear program into a deterministic ILP, SummarySearch employs a clever trick to reduce the size of the ILP it needs to solve while still ensuring a feasible solution: It constructs a very small summary of the scenarios and solves the ILP over this summary rather than the full set of scenarios and it validates the solution over a larger set of generated scenarios. For the exact details for how and why this strategy works, we refer you to our 2020 SIGMOD paper [8]. Yet again, ideas like sampling and summarization that are natural to database systems researchers can lead to orders-of-magnitude performance improvements to the challenging stochastic programming problem.
Our work on package query problems scratches the surface of the kinds of support database systems can provide for the larger class of constrained optimization problems and prescriptive analytics in general. In an upcoming bulletin, we describe many of the open research challenges in this space and our vision for an integrated predictive and prescriptive framework within database systems [1]. Reflecting on our work with EpiPolicy, which did not integrate any of its functionality within a database system, we know that there are still many missing pieces to achieving full in-database prescriptive analytics. The main missing piece is perhaps the right set of abstractions. We need to define abstractions that can allow a system to automatically decide how to transform any decision-making problem into an appropriate formulation whether it is a package query, a linear programming problem, a stochastic one, a constrained optimization problem or even a reinforcement learning problem that involves a series of decisions over a time horizon. We also need abstractions that can allows us to generalize across all forms of data: whether it is a set of tuples in a table, it is a set of simulated or sampled experiences, or data represented by a (predictive) model.
The emerging research area of in-database prescriptive analytics aims to provide seamless domain-independent, declarative, and scalable approaches powered by the system where the data typically resides: the database. As database systems researchers, we are capable of tackling many of the challenges in enacting this vision of full in-database decision support such as creating better interfaces and query languages to support the specification of decision-making problems, designing scalable algorithms and data structures for prescriptive analytics, developing probabilistic and approximate techniques to handle data and environment uncertainty, and designing data and model management systems that support tightly connected predictive- and prescriptive- frameworks. This is an open area with lots of exciting research problems and you can help!
This work would not have been possible without my amazing collaborators, students and friends: Dennis Shasha, Anh Mai, Zain Tariq, Miro Mannino, (the EpiPolicy team), Alexandra Meliou, Peter J. Haas, Matteo Brucato, Anh (again), Riddho Haque (the constrainted optimization team) and many other students and interns, the public health officials who gave us some of their valuable time and feedback and generously shared their experiences. Many thanks to our funders: the NYUAD COVID-Facilitator fund, the ASPIRE Award for Research Excellence (AARE-2020) grant AARE20-307, the NYUAD Center for Interacting Urban Networks (CITIES) and Tamkeen under the NYUAD Research Institute Award CG001.
Azza Abouzied’s research work focuses on designing intuitive data querying tools. Today’s technologies are helping people collect and produce data at phenomenal rates. Despite the abundance of data, it remains largely inaccessible due to the skill required to explore, query and analyze it in a non-trivial fashion. While many users know exactly what they are looking for, they have trouble expressing sophisticated queries in interfaces that require knowledge of a programming language or a query language. Azza designs novel interfaces, such as example-driven query tools, that simplify data querying and analysis. Her research work combines techniques from various research fields such as UI-design, machine learning, and databases. Azza Abouzied received her doctoral degree from Yale in 2013. She spent a year as a visiting scholar at UC Berkeley. She is also one of the co-founders of Hadapt – a Big Data analytics platform.
[1] Azza Abouzied, Peter J. Haas, and Alexandra Meliou. “In-Database Decision Support: Opportunities and Challenges”. In: IEEE Data Engineering Bulletin (Sept. 2022).
[2] Matteo Brucato, Azza Abouzied, and Alexandra Meliou. “A Scalable Execution Engine for Package Queries”. In: SIGMOD Rec. 46.1 (May 2017), pp. 24–31. issn: 0163-5808. doi: 10.1145/3093754.3093761. url: https://doi.org/10.1145/3093754.3093761.
[3] Matteo Brucato, Azza Abouzied, and Alexandra Meliou. “Improving Package Recommendations through Query Relaxation”. In: Proceedings of the First International Workshop on Bringing the Value of ”Big Data” to Users (Data4U 2014). Data4U’14. Hangzhou, China: Association for Computing Machinery, 2014, pp. 13–18. isbn: 9781450331869. doi: 10.1145/2658840.2658843. url: https://doi.org/10.1145/2658840.2658843.
[4] Matteo Brucato, Azza Abouzied, and Alexandra Meliou. “Package Queries: Efficient and Scalable Computation of High-Order Constraints”. In: The VLDB Journal 27.5 (Oct. 2018), pp. 693–718. issn: 1066-8888. doi: 10.1007/s00778-017-0483-4. url: https://doi.org/10.1007/s00778-017-0483-4.
[5] Matteo Brucato, Azza Abouzied, and Alexandra Meliou. “Scalable Computation of High-Order Optimization Queries”. In: Commun. ACM 62.2 (Jan. 2019), pp. 108–116. issn: 0001-0782. doi: 10.1145/3299881. url: https://doi.org/10.1145/3299881.
[6] Matteo Brucato et al. “Scalable Package Queries in Relational Database Systems”. In: Proc. VLDB Endow. 9.7 (Mar. 2016), pp. 576–587. issn: 2150-8097. doi: 10.14778/2904483.2904489. url: https://doi.org/10.14778/2904483.2904489.
[7] Matteo Brucato et al. “SPaQLTooLs: A Stochastic Package Query Interface for Scalable Constrained Optimization”. In:Proc. VLDB Endow. 13.12 (Aug. 2020), pp. 2881–2884. issn: 2150-8097. doi: 10.14778/3415478.3415499. url:https://doi.org/10.14778/3415478.3415499.
[8] Matteo Brucato et al. “Stochastic Package Queries in Probabilistic Databases”. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. SIGMOD ’20. Portland, OR, USA: Association for Computing Machinery, 2020, pp. 269–283. isbn: 9781450367356. doi: 10.1145/3318464.3389765. url: https://doi.org/10.1145/3318464.3389765.
[9] Kevin Fernandes et al. “PackageBuilder: Querying for Packages of Tuples”. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD ’14. Snowbird, Utah, USA: Association for Computing Machinery, 2014, pp. 1613–1614. isbn: 9781450323765. doi: 10.1145/2588555.2612667. url: https://doi.org/10.1145/2588555.2612667.
[10] Peter J. Haas et al. “Data is Dead… without What-If Models”. In: Proc. VLDB Endow. 4.12 (Aug. 2011), pp. 1486–1489. issn: 2150-8097. doi: 10.14778/3402755.3402802. url: https://doi.org/10.14778/3402755.3402802.
[11] Anh Le Xuan Mai et al. “EpiPolicy: A Tool for Combating Epidemics”. In: XRDS 28.2 (Jan. 2022), pp. 24–29. issn: 1528-4972.doi: 10.1145/3495257. url: https://doi.org/10.1145/3495257.
[12] Laurynas Siksnys and Torben Bach Pedersen. “Demonstrating SolveDB: An SQL-Based DBMS for Optimization Applications”. In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. 2017, pp. 1367–1368. doi: 10.1109/ICDE.2017.180. url: https://doi.org/10.1109/ICDE.2017.180.
[13] Zain Tariq et al. “Planning Epidemic Interventions with EpiPolicy”. In: The 34th Annual ACM Symposium on User Interface Software and Technology. UIST ’21. Virtual Event, USA: Association for Computing Machinery, 2021, pp. 894–909. isbn: 9781450386357.doi:10.1145/3472749.3474794. url: https://doi.org/10.1145/3472749.3474794.
Copyright @ 2022, Azza Abouzied, All rights reserved.