October 10, 2024
The past few years of generative AI have upended research agendas across academia. Having just spent my sabbatical in the Bay Area, where the San Francisco fog is mixed with a tinge of forest fire and LLMs, I wanted to reflect on the role of the academic database research community within this sea change from the perspective of competitive advantage. The outlook is hazy, but there is room for hope.
Academic database research started around RDBMSes. While we have worked on an amazing array of data-related topics, our most visible, successful, and influential core has remained the RDBMS. Today, existing RDBMSes are ubiquitous and good enough. Continuing to improve RDBMS technology is helpful, but not necessarily competitive with industry, and the gains are increasingly marginal–we are polishing a round ball. A promising direction is AI, but while AI is “an application of data,” it doesn’t seem like AI needs us.
We, as a community, have to choose along a continuum with two extremes. One extreme is to seek critical problems that are both sufficient to sustain a research community and that we are uniquely suited to solve. The other is to become a diffuse community centered around “data-thinking”. Wherever we choose to land along this continuum warrants serious discussion.
I use the terms “we,” “community,” or “database community” to refer to the academic database community, distinct from industry and open source.
A first-order approximation of databases as a field that our influence and success is centered around realizing Codd’s vision [1]: the relational database management system (RDBMS) as the system for managing structured data, and getting it into the hands of as many users as possible. Unlike other academic fields, this goal centers around building an artifact: the RDBMS. Because of that, the goal is concrete, actionable, and achievable. This goal ties experts from across computer science—theory, systems, PL, ML, HCI, architecture, and more—into a cohesive database community.
RDBMSes embody three principles: the one and true relational model, data independence through a declarative abstraction (e.g., SQL), and transactional semantics. These principles are so foundational that no modern organization can operate without an RDBMS. There was so much demand for RDBMSes, that over the past six decades, we could simply focus on answering the major questions to make them ubiquitous:
The insatiable craving for RDBMSes was such that nothing else mattered as long as this was satisfied. The world needed RDBMSs that anyone could set up, manage, and scale. It ultimately didn’t matter that we were side-tracked by non-relational data models and systems, like object-oriented databases and XML. Or that we ceded web research to the information retrieval community. Or that we arrived late to web-scale distributed systems and had to play catchup. It didn’t matter because the beast was hungry, and only we could keep it fed.
Today, RDBMSes are used by everyone, run everywhere, and scale up to thousand-machine clusters and down to lightweight sensors and browsers. SQL is the second most popular programming language. We have trained generations of database researchers who are now highly positioned in top universities and technology companies, and capable engineers who are building incredible database services. There are hundreds of data systems today!
That is a lot to be proud of. For six decades, the world needed databases, and databases needed us to grow. We strove to make databases ubiquitous, and we delivered. We won.
What now?
Well, as an individual researcher or research group within the database community, nothing changes. RDBMSs can be made easier to manage, more usable, cheaper, faster, and easier to construct. New hardware still requires system redesign. More applications can benefit from declarative thinking and abstractions. And there are so many ways to integrate machine learning into data systems.
Unfortunately, as an academic community, I worry that these are increasingly marginal benefits to the world. While they are excellent database problems, internal validation doesn’t shield us from external competition against the capitalist marketplace and the industrial academic complex for resources: funding, students, and attention. If the world doesn’t look at a problem and think, “crap, we need a database academic,” then the resources will shift to a competitor (increasingly AI).
This means that “Contributing” isn’t good enough. “Novel” isn’t good enough. Patting ourselves on the back isn’t good enough. We need to own a problem space completely.
We must keep asking: “Does this problem matter to the world? And compared to all of industry and academia, is academic database research necessary to solve it?”
Let me make this argument concrete by analyzing our competitive advantage. I hear the world has an insatiable appetite for AI. So much so that Larry Ellison and Elon Musk spent a dinner begging Jenson Huang of NVIDIA to take more of their money. $2 trillion of cravings for GPUs.
Let’s use a potential research direction in AI: what if we could query anything and everything in the world using LLMs?
This checks the first criteria because analyzing documents, reviews, legalese, contracts, and more is hard. There is also a lot to learn: what does query execution and optimization on LLMs look like? How can LLMs aid optimizers? What if LLMs were access methods? I like it, you like it, and many of us are working on related projects!
But, are we uniquely positioned to dominate this problem?
On the one hand, of course! Writing programs using LLMs is hard, optimization is painful and manual, and LLMs are unreliable in a way that makes me want to cry. Declarative is tried and true, and we are the experts in cost-based optimization and approximation guarantees, distributed data processing, and heterogeneous computing environments. They can do RAG but we can do it better.
On the other hand, LLMs are similar to people, and crowdsourced databases weren’t exactly a hit. RDBMSes are designed for predictable hardware that co-evolves over decades, yet LLMs as a compute substrate evolve weekly, meaning that any optimization rules based on yesterday’s trade-offs are immediately slower, more expensive, or lower quality. They are impossible to reason about because the responses change for no reason, their tunable parameter is “any text,” and there are no semantics to create correctness benchmarks, only vibes.
To reiterate, of course academic researchers should work on this because it’s hard, fun, exciting, and might lead to new insights. However, startups, cloud database companies, open-source projects, LLM companies, and well-funded startups are all well-positioned to work on this too. So is academia: the NLP community is for text, and the CV community for video.
We can help. We can optimize. We can create frameworks. But which part of this problem is solely an academic data management problem? And is that part the lynchpin for AI?
Similar reasoning should be applied to data preprocessing for AI, video querying, provenance for AI, NL2SQL, ML training and serving, prompt engineering for data tasks, and everything else.
This argument regularly repeats itself in our community (see Dewitt’s 1995 keynote, Stonebraker’s 2018 Ten Fears talk, or my own 2019 CIDR gong show talk). But concerns should be paired with action, so let me offer three points along a continuum in decreasing levels of ambition, but all better than continuing to work on yesterday’s problems.
The first direction is to find our next north star. To do so, let’s keep asking: “Does this problem matter to the world? And compared to all of industry and academia, is academic database research necessary to solve it?”
We cannot be content with problems that do not transform how the world works. We cannot be content with problems where our expertise is nice to have or even competitive—we must be necessary.
I suggest that we make these questions a criterion for vision papers, dedicate a PC Chair with strong views and excellent taste to identify these papers, and explicitly reward such projects. Let’s host a series of panels across our major conferences to debate this issue.
And industry must actively participate because, while we provide high-quality training, a community cannot subsist through only training. Nor by working on near-term problems. It survives and thrives and gets the best students because there is a problem that is so large, so important, and so unique to data management that hundreds of future database researchers are drawn together for another half-century to tackle it.
A world-changing problem is a tall order. An alternative is to articulate Essential Capabilities that are missing today. An Essential Capability enables applications that simply does not and cannot otherwise exist. For instance, artificial general intelligence is one such capability in AI. We also need venues and events to debate and reason through this.
But let me throw out a few random examples. If two companies can combine their databases and their applications magically still work, what is made possible? If it is possible to cheaply monitor and manage all data flows throughout an organization, does it change the nature of security? Systems need to load their program state into the DBMS to enjoy its many features; can those features be compiled and run directly on the program state without running a DBMS?
The third direction is to accept that data management is akin to a state of mind like design thinking. That is where practitioners approach a new data problem by identifying a structured data model, a domain-specific language that describes the desired computation, and instantiating it using data processing primitives. The Human-Computer Interaction community embodies such a field, and conferences like the CHI (Conference on Human Factors in Computing Systems) are a cornucopia of diverse problems and methods drawn together by a shared belief.
If this is the case, then our community is tasked with becoming domain experts to evangelize declarative thinking to cultures throughout the world, of which AI is just one culture. We must clarify what data- (or declarative-? scalable-?) thinking means. We must broaden our conference programs to embrace any and all data problems; where biologists, sociologists, industrialists, and other domain experts want to publish with us.
It is inevitable that the community will evolve, but how we do so should be a choice and not an accident. Let’s do so while keeping in mind that academic research is an investment by society to study impractical problems today that might become new industries in the future.
I look forward to discussion and feedback. Please reply in the comments, on Twitter, in person, or over email.
Should we organize a SIGMOD 2025 panel? Share your opinion.
Eugene Wu is broadly interested in technologies that help users play with their data. His goal is for users at all technical levels to effectively and quickly make sense of their information. He is interested in solutions that ultimately improve the interface between users and data, and uses techniques borrowed from fields such as data management, systems, crowd-sourcing, visualization, and HCI. Eugene Wu received his Ph.D. from MIT, B.S. from Cal, and was a postdoc in the AMPLab. A profile, an obit.
Eugene Wu has received the VLDB 2018 10-year test of time award, best-of-conference citations at ICDE and VLDB, the SIGMOD 2016 best demo award, the NSF CAREER, and the Google, Adobe, and Amazon faculty awards.