February 23, 2022
We expect virtually every database to be on Cloud shortly. All database vendors strive to survive in this competing market. In the era of cloud computing, databases will become easily accessible, and thus even non-technical users want to use such databases like their TVs. In this situation, we need to interact with databases through accessible interfaces rather than using formal query languages such as SQL. The natural language (NL) based interface is the most intuitive and easiest way to compose queries, which has recently attracted attention. The natural language (NL) based interface is the most intuitive and easiest way to compose queries, which has recently attracted attention.
This blog post explains what NL-based exploration in a chatbot is. I also review the state-of-the-art methods related to this problem. I then explain what kind of challenges we need to address in a few years. Note that this post is based on my mini-keynote speech in SIGMOD 2021.
Chatbots are the most popular ways to interact with people. A chatbot is essentially a robot we can chat or interact with online via text or text-to-speech. There are diverse chatbots in many domains. Examples include Google Assistant (a more general-purpose chatbot), HEEK, a website creation chatbot, and Amtrack, a rail passage service chatbot.
Data exploration with ordinary people could be possible with special chatbots. In our group, we have been developing SQLBot , a special chatbot for enabling exploration with relational databases. Figure 1 shows the screenshot of SQLBot. The left pane shows a database schema, while users can type an NL question in the chatbox of the right pane. SQLBot shows query results in various visual formats (table, pie chart, histogram, etc.). The SQLBot computes a confidence score for the translated SQL query for a given NL question.
I think the data exploration chatbots need to support the following three functionalities. The first and foremost part is the accurate translation of NL to SQL (NL2SQL). The second part is to visualize query results effectively. This part is rule-based since the output schema of an SQL query and its query results can determine visual formats. Alternatively, one can give NL instructions on visualizing the query results (NL2VIS) . The last part is how to simulate human conversation in the context of data exploration with relational databases. Here, we should consider a dialogue history. Looking at the dialogue in Figure 1, SQLBot automatically recognizes the dialogue action type  of the current NL question. Here, “inform_sql” displayed below an NL question means an SQL query can answer the user question. When user questions are ambiguous or unrelated, they are classified as “ambiguous” or “cannot_answer,” respectively.
NL2SQL finds an SQL statement to answer an NL question on a given relational database. More formally, given a relational database D and an NL question qnl to D, NL2SQL executes a translated SQL statement qsql to answer qnl. . The translated query qsql must be semantically equivalent to a gold query qgold. Given two SQL queries qsql and qgold, they are semantically equivalent if and only if they always return the same results on any input database instance.
Chat-80 is one of the first relational chatbots developed by Pereira and Warren in 1982  (Please see the history of NL2SQL (Figure 2)). This bot can answer questions on a specific domain, geography, and they use a manually-built lexicon and grammar.
Modern NL2SQL systems encode two sequences using a transformer-based encoder: an NL query and a schema (mainly in a linearized form of schema information). The question encoder encodes a series of NL tokens into their deep representations (a sequence of vectors), while the schema encoder encodes a graph-structured schema into the corresponding deep representations. Unlike sentences, we must consider complex relationships between nodes in the schema graph. One could use a single transformer to encode both sequences. Alternatively, we can use separate encoders for encoding individual sequences.
When encoding the two sequences, modern NL2SQL systems exploit schema/value linking, enabling the model to align column/table references to the corresponding columns/tables. This information captures latent linking between question and table or columns. Typically, three types of linkings are considered: a question token to a column name; a question token to a table name; and a value token to a column name.
Figure 3 shows three different linkings between the given natural language question and the schema.
The word stadium in the query is linked to the table stadium (question-to-table linking) and the column stadium_ID (question-to-column linking) using exact and partial matching, respectively. The token Super bootcamp is linked to the column concert_Name (question-to-value linking), since the value “Super bootcamp” is contained in the concert_Name column.
Once schema/value linking is complete, modern NL2SQL systems perform relation-aware self-attention  to effectively bias the transformer toward pre-defined relations such as QUESTION-COLUMN and QUESTION-TABLE. For each input token, self-attention computes an output vector as a contextualized representation for all input tokens (input vectors) using weighted sums of input vectors. Here, a weight learned during training represents the relative similarity between two input vectors.
Figure 4 describes the attention weights of the token stadium using relation-aware self-attention. The token stadium in the NL question has a strong attention weight to the column name, stadium_ID, since the schema linking module captures their relation. The token, stadium, is also strongly linked to the table name, stadium, while a significantly lower attention weight is allocated to the pair of the tokens, (stadium, concert).
In conversational NL2SQL, capturing context during dialogue is the most crucial issue to address. For example, consider the first two questions, “How many dorms have a TV Lounge” and “what is the total capacity of these dorms. Here, we need to understand what these dorms refer to, which is essential in a translation.
Given an NL2SQL system, one can extend it to support dialogue-based NL2SQL by considering dialogue history. For example, a state-of-the-art system, SCoRe , extends RatSQL  in two aspects. 1) SCoRe pretrains a language model for NL2SQL using additional loss functions, injecting context-related inductive bias into the language model. 2) The extended RatSQL additionally takes a dialog history to understand the context of the dialogue. To consider the multi-turn dynamics of conversation, SCoRe extensively uses synthetically generated data in the pretraining. SCoRe first extracts templates of question-SQL query pairs from the dialogues in the training data of existing benchmarks such as SParC. It then generates synthetic data by applying those templates to web tables.
The state-of-the-art NL2SQL systems archive about 75% accuracy in the Spider leader board . Despite their high accuracy, one may wonder whether the state-of-the-art NL2SQL system can accurately translate queries in Chat-80. We tested six queries from the VLDB 81 paper . For experiments, we chose RatSQL, which is one of the state-of-the-arts.
Q1) Which employees aged over 40 on the first floor earn more than their managers?
Q2) Which countries bordering the Mediterranean border Asian countries?
Q3) Which is the ocean that borders African countries and (that borders) Asian countries?
Q4) What are the countries bordering the Soviet Union whose population exceeds the population of the United Kingdom?
Q5) Which American countries do not border the Pacific?
Q6) Which countries border exactly two seas?
Although some queries (e.g., Q1, Q5, and Q6) seem easy to translate, surprisingly, the state-of-the-art NL2SQL system does not correctly predict any query used in VLDB 1981. The modern NL2SQL system does not accurately handle complex join conditions and (correlated) nested queries. Translating NL questions involving group by/having clauses is also challenging due to the limited scope of the NL2SQL system.
Consider Q1. The following table shows the gold query and the translated query for Q1. As we see here, RatSQL fails to isolate independent parts of a complex NL question. Here, employee tables are referenced twice within a nested query.
NL-based data exploration has a high potential due to the advance of deep learning techniques. I explained three functionalities in the NL-based exploration chatbot. I then explained how the state-of-the-art NL2SQL system (i.e., the most core part in NL-based exploration chatbot) translates NL questions into SQL queries. Based on a simple experiment using a set of queries used in Chat-80, I listed some critical issues that hinder the deployment of data exploration chatbots in real-world relational databases. I believe that improving NL2SQL systems requires real-world databases/workloads which are much more complex than existing benchmark queries based on toy databases. I encourage more people in our database community as domain experts to work in this area.
Wook-Shin Han leads the database group at POSTECH. He is currently the Vice President for the Office of Academic Information Affairs, a team leader for the Brain Korea 21 Program for Leading Universities & Students (BK 21) project for AI, and a Full Professor in the Department of Computer Science and Engineering at POSTECH. He has obtained his Ph.D degree from KAIST in 2001. His primary research efforts have been devoted to developing new techniques in DBMS “engine research.” His group recently developed three systems called TurboGraph++ (SIGMOD2018), iTurboGraph (SIGMOD2021) and TurboFlux (SIGMOD2018) for trillion-scale, incremental graph analytics. He regularly serves as a PC member for SIGMOD, VLDB, and ICDE. He also serves/served as an associate editor of several international journals including the VLDB Journal, IEEE TKDE, and SIGMOD Record.
2. Luo, Y., et al. Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks. in SIGMOD. 2021.
3. Yu, T., et al. CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. in EMNLP. 2019.
4. Kim, H., et al., Natural language to SQL: where are we today? Proceedings of the VLDB Endowment, 2020. 13(10): p. 1737-1750.
5. Warren, D.H. Efficient processing of interactive relational database queries expressed in logic. in VLDB. 1981.
6. Shaw, P., J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations. in ACL. 2018.
7. Yu, T., et al. SCoRe: Pre-training for context representation in conversational semantic parsing. in ICLR. 2020.
8. Wang, B., et al. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. in ACL. 2020.
Copyright @ 2022, Wook-Shin Han, All rights reserved.