February 9, 2017
The big data era (now more than a decade old) ushered in dramatic shift in mentality whereby enterprises increased their appetite for making “data-driven” decisions. To a large extent, most enterprises have gotten very good at collecting data. Now, the next big challenge is to unlock non-trivial insights that are hidden in that data. There are large inefficiencies in how data is analyzed today, and that needs to change.
Let me illustrate with an example. Consider a subscription-based consumer company, like an insurance provider or a cell phone provider. Now let’s walk through what happens when a high-level executive/CXO in such an organization wants to ask a “simple” business question: Why do my customers churn?
Today, this question gets passed down the chain of command from the CXO to a Vice President to a Director and then to an engineer. The engineer, who is likely called a data scientist, now proceeds to put together a workflow to answer this question. Assembling this workflow often amounts to gluing scripts to select the appropriate data, select features, create a few (machine learning) models, test the models, and repeat this exercise many times over. All along, the data scientist uses code fragments obtained from previous exercises and/or searches the web for appropriate code fragments. Finally, a program is put together to answer the question at hand. Once there is an answer, the data scientist invokes the chain of command, this time in the reverse direction, to send the answer back to the CXO.
As one can imagine, this process is inefficient from the perspective of both the time and the resources that are spent in finding the answer. This inefficiency is particularly stark as in many (note, I do not say “all”) cases the actual work involved in creating the data science code is fairly routine. This existing process is also error prone as subtle programming bugs may go unnoticed for a long time. This process is often not easily reproducible as code that is assembled is often best understood and run only by the engineer who put the code together.
What if there was a method to take the initial question that the CXO asked and to answer that question using a computing artifact that knew how to put together the required code, quickly, correctly, and in a way that was perfectly reproducible. In this scenario, many (again, not all) business questions can be answered immediately and efficiently. The interface for this widget could be a chatbot that converts conversation to code.
This overall vision is quite challenging, as one of the problems is to unambiguously understand the natural language input. This problem is closely related to the “Imitation game” challenge proposed by Turing in 1950. We have taken an initial step in this direction by skirting the Imitation game challenge. Our approach is to use a chatbot mechanism to carry out a conversation with the user, and to synthesize the required code from that conversation. This initial system is called Ava, and it was just presented at CIDR. Ava is limited in scope today, but it is still quite powerful as its “programming interface” is a conversation. Many steps in the data science process are automated, and a number of safe-guards are in-built. For example, when loading a file at the request of a user, Ava can automatically detect that the file has missing values. It then prods the user for how s/he wants to deal with missing values, and makes suggestions about the alternatives that can be used. Similarly, Ava makes suggestions about the machine learning models are likely appropriate for the task at hand. All conversations with Ava are in a constrained natural language, and the conversations are bounded to pre-defined “conversation paths”. Admittedly there are many limitations with this initial approach (and hence lots of opportunities for future research), but it is a step towards constructing data analysis pipelines using chatbots.
In a limited test of this system we asked 16 data scientists to create a model for a Kaggle task using either Ava or Python. We found that the productivity in creating a model was nearly an order of magnitude higher with Ava!
Many of you may have noted a subtle but important difference in the use of natural language in Ava compared to previous work on natural language querying (such as CHILL, Microsoft English, Nalix, and PRECISE, which have inspired our work). A key difference is that Ava, and chatbots in general, lean heavily on natural language translation (NLT) rather than natural language understanding (NLU). NLT is largely a solved problem today, whereas there is still a lot of work to do in perfecting NLU. Thus, chatbots can be deployed successfully today, and they can continually leverage improvements in NLU to increase the sophistication of the conversation. An even more important aspect of chatbots is that they force the bot creator to decompose the overall task into a sequence of well-defined and composable sub-tasks that have natural boundaries from the perspective of the human that is driving the bot. A layer of dialog is then used as “glue” to help put the required sub-tasks together in the right sequence. This aspect of focusing on boundaries that are natural to the human is crucial for bots, and distinguishes it from workflow-based approaches that don’t always have this laser focus on the end-user perspective. (On a side note, various efforts in our community in related areas, such as provenance, are crucial for bots-based systems. However, that is a longer discussion, and likely another blog article.)
All this work is set against a larger sea-change in which bot-based automation in consumer and enterprises applications is about to relieve people from the boring and mundane tasks that they carry out today. Bots will soon do many such tasks far more effectively. There is of course a real and related question about the larger societal impact of these productivity gain. Economists have a term called “technological unemployment” to describe the displacement of jobs by technology. Its long-term impact on employment is hotly debated. We in Computer Science are frantically creating this automation technology, and there is no stopping the technology from being created (so let’s not argue about that). But, we need to think about what we can do to better prepare our students for this automated future. This is a tough question. I’m now delving into a topic well beyond the original scope of this blog. So, I end by noting that it will be crucial to teach our students a few things in-depth (so that they can build the core technologies that power bots), and teach them how to learn and adapt quickly to changing technologies.
Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison. His papers have been selected as the best papers in the conference at VLDB (2012), SIGMOD (2011) and ICDE (2010, 2011). He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997. In 2007 he founded Locomatix, which became part of Twitter in 2013, and seeded the technology that became Heron. Heron now powers all real-time services at Twitter. His last company, Quickstep Tech. was acquired by Pivotal in 2015. He also enjoys teaching and is the recipient of the Wisconsin “COW” Teaching Award, and the U. Michigan College of Engineering Education Excellence Award. He is an ACM Fellow, and serves on the board of Lands’ End and a number of technology startups.
Copyright @ 2017, Jignesh Patel, All rights reserved.