Courting ML: Witnessing the Marriage of Relational & Web Data Systems to Machine Learning

Big Data, Databases, Interview, Machine Learning
The web is an ever-evolving source of information, with data and knowledge derived from it powering a great range of modern applications. Accompanying the huge wealth of information, web data also introduces numerous challenges due to its size, diversity, volatility, inaccuracy, and contradictions. This year’s WebDB 2018 theme emphasizes the challenges and opportunities that arise at the intersection of web data and machine learning research. On one hand, a large portion of web data fuels ML, with novel applications such as predictive analytics, Q&A chat bots, and content generation. On the other hand, the new wave of ML technology found its way into traditional Web data challenges, with contributions such as web data extraction with deep learning, and using ML to optimize data processing pipelines.

To kick start the conversation on research at the cross hairs of ML and data, we interviewed Luna Dong (Amazon Research), Alkis Polyzotis (Google), Jens Dittrich (Saarland University), Arun Kumar (University of California, San Diego) and Peter Bailis (Stanford University). Below you will find their bios. We selected this diverse set of academic and industrial, systems and theoretical researchers to better understand the quickly evolving research field of Machine Learning and Database Systems. We asked them about their motivation for working in this field, their current work and their view on the future. We summarize our interviews along the following four questions.

Q1. What are the key challenges in ML from a data systems perspective? Can you provide a concrete example of something that is missing or needs new solutions?

Luna: Despite the effectiveness and promise, the success of ML solutions comes with a price: high-quality results require enormous training examples. For example, Google showed that even after providing 300 million images to train a deep learning image recognition model, no flattening of the learning curve was observed. Collecting so many training data presents a big challenge for some of the data problems, especially data integration and data cleaning.

Data integration has been aiming to seamlessly integrate data from many data sources, such as millions or even billions of web sources. These sources observe more or less different schemas to describe a domain, represent the same entity using more or less different attribute values, format the data in different ways, and provide data with various characteristics. Even more, the data are evolving over time! To integrate data without compromising the quality, we need to supply a large volume of training data for each data source, possibly for each domain and each data type, and we need to update the training examples over time. This is why supervised training, in spite of its huge success in many other domains, appears to be infeasible for data integration, and for a long time researchers have been focusing on optimization or unsupervised learning to solve the integration problem. Similar problems are present for data cleaning, since there are so many different ways to make mistakes.

In my mind, the real breakthrough for data integration and cleaning will happen when we are able to find an effective way to apply supervised learning for millions of data sources and data domains. To achieve this goal, we shall resort to all related techniques, such as applying active learning to generate the most useful training labels, applying transitive learning to transfer the model we have trained on one data source to a new data source, and applying reinforcement learning to balance exploration and exploitation. All of these require seamlessly combining human and machine, also called human in the loop. However, it should not be a system that facilitates human, such as data scientists, to easily ask for labels and apply training; instead, it should be a system that treats machines, labelers, and data scientists all as resources, and allocates the resources in a way aiming for best long-term effectiveness.

Alkis: At a high level, one can look at an ML system as a dataflow: training data comes in to the trainer to generate a model, and the latter is fed serving data to generate predictions online. Hence, it is interesting to consider the bottlenecks of this dataflow and ask how it can be optimized using techniques from data management and query processing. At the same time, it is equally important to consider the end points of this dataflow, namely the training and serving data, because, at the end of the day, fast training and serving are immaterial if the data is wrong. In this space, there are many interesting problems in data modeling, tracking, analysis, validation, cleaning, and I think that data-management research can offer interesting solutions.

As a concrete problem, consider the task of data validation where we want to answer a very basic question for ML: Are there any errors in the data? This is a fundamental question in productionizing ML and also in debugging the quality of a ML model. It is also surprisingly challenging to address, for several reasons: the common ML input formats provide limited semantics in order to judge “correctness”; not all data errors are important, since ML can be resilient to noise and input features have a different effect on model quality; errors are eventually surfaced to a on-call engineer to correct them, and so false-positive alarms are bad (too many such alarms make it hard to deal with actual errors); and, it is important to explain and localize these errors, so that the engineer can determine and fix the root cause for the error.

Jens: One of my students phrased our possible contributions to machine learning research nicely: “It is unlikely that we will invent the new back propagation algorithm, it is not our job; we are database experts.” So our research contribution at the intersection of both fields will still be 80% databases and 20% ML.

I started looking into this intersection three years ago when I gave a seminar on deep learning and was fascinated by the advancements in that field. My research interests shifted over the past years. So rather than running performance evaluations on large artificially generated datasets, I am now more and more playing with real datasets. That is very rewarding! We are working with massive data centers of recorded weather observations and trying to forecast the weather using machine learning and deep learning [1]. We realized that our research on scaling database technology only comes into play after everything is set up properly. Also, most data scientists when they run into scaling issues, decide to use Apache Spark, Flink or similar DB technologies. So scalability is generally not the key challenge. What is missing is the tooling support to help data scientists create valid workflows and data pipelines. The ETL process alone can take 80% of the analysis time and effort. We can bring in our expertise here: to clean the data, to automatically map schemas, to build robust workflows, to handle updates and triggers, to handle streaming data, etc.

Arun: I can highlight three major challenges from my own research in the ADALab at UCSD. First, how do you get nicely clean training datasets with well-defined features in the first place? This involves integrating, cleaning, and organizing disparate data sources, structuring them, labeling them (if needed), and extracting/transforming features relevant for the ML models. While such issues have been studied for SQL workloads, ML workloads present novel twists and opportunities.

Second, model building, or once you have your dataset, how do you decide what prediction function you will use for your application? At the heart of this is the overarching process of ML model selection, which is almost always data scientist-in-the-loop and involves feature engineering, algorithm selection, and hyper-parameter tuning. The current paradigm of doing this at scale is ridden with both system and human inefficiency. I wrote more about this in a vision article for SIGMOD Record [2] and we are looking at tackling such issues in the top-level Triptych project [3].

Third, model deployment or after you get your prediction function, how best to integrate it with the application and oversee the model and data as the application evolves? This involves issues such as model serving at high throughput and low-latency for Web services. It also involves making it easier to deploy complex models such as deep neural networks for large-scale data analytics, e.g., as we are doing with deep CNNs in the Vista project [4].

Peter: Today, there are fundamental gaps in our understanding of some of the most critical data product and ML development tasks, from training data collection to model deployment and serving. Despite the fact that these tasks are key requirements for actually using and deploying ML in a production environment, the study of how to build systems and tools to facilitate these tasks – especially at scale – is just in its infancy. As a result, users performing many of today’s most valuable ML analytics – e.g., for prediction, recommendation, and root cause analyses – must repeatedly reinvent the wheel via expensive, bespoke, and ad-hoc data engineering and ML efforts that are often restricted to the best-funded, best-trained teams. The success of the relational database offers an existence proof of an alternative approach: building reusable, modular, and high performance data-driven tools that make analytics accessible to and cost-effective for a broad spectrum of users. As a concrete task, try building a reusable system for training data collection, model serving, or hardware-efficient inference; I believe each of these problems contains tens of PhDs worth of systems research. The broader opportunity is for the data systems community to build data-intensive tools for making each stage of the ML development and deployment life cycle more usable and more efficient. This is our goal in the Stanford DAWN project [5], a new five-year project centered around infrastructure for usable ML.

Q2. Are there data/systems challenges that could be solved by looking into the ML research? Any success stories?

Luna: To answer this question, we need to first see what is missing from non-ML solutions. I will focus on data integration and data cleaning to illustrate. As we start tackling a problem, we naturally encode our intuitions or heuristics into a set of rules and thus come rule-based solutions. This is seldom optimal, because the rules may not be complete, may have side effects, and applying the rules in different orders may lead to different results. That is why we always call rule-based solutions ad-hoc solutions.

We thus evolve to the optimization approach: based on our understanding of a problem, we use an objective function to encode our intuition on what the optimal solution should look like, sometimes under some conditions or constraints. For example, we model the entity linkage problem as a clustering problem such that each cluster corresponds to a real-world entity, and solve the problem by minimizing some clustering metrics, such as DB-index, or the penalty used in correlation clustering. As another example, we solve the data cleaning problem by minimizing the number of changes to make the data satisfy a set of given constraints. Often solving the optimization problems requires exponential time, so we use polynomial-time approximation algorithms. The approximations are much more principled than the rule-based solutions. However, this approach is not perfect: one may find a hard time proving that the optimization goal is well aligned with the problem; for example, even after a decade of research, industry still hold suspicions on data cleaning by using the minimal set of changes to find correct fixes!

ML saves human efforts by finding appropriate objective functions for optimization. We give a set of examples and the expected results, and the ML models find the objective functions that generate the desired solution. Recently, deep learning even saves the efforts for feature engineering. As such, problems can be solved by providing enormous examples to teach the system to find the ideal solution. Arguably, this is even a more principled approach and avoids more human whim.

Alkis: In principle, any heuristic-based system component can be substituted with a machine-learned model, provided that there is sufficient input/output data for training. Take for example the problem of plan-cost estimation. This is notoriously hard to get right and systems typically employ complicated heuristics around selectivity estimation and cost factors. It is intriguing to think whether we could simplify this component through the use of ML. One may argue that this problem is an obvious candidate for ML given that imprecise outputs are acceptable: it is ok to misestimate plan cost as long as the error is not huge and that’s why it’s ok to use a machine-learned model. However, ML can also enable novel and surprising approaches in areas that have stricter semantics. Our recent work on Learned Indexes [6] demonstrates precisely this point. Specifically, we argue that B-trees, hash-indexes, and bloom filters can be viewed as models that map an input key to some output property, e.g., the position of the key in the sorted data in the case of a B-tree. In turn, this opens up the possibility of using ML to build these models and reap several important benefits: tailor the index to the specific data distribution, reduce drastically the size of the index, leverage ongoing work in ML infrastructure and hardware, and many others. Our work barely scratched the surface in this area, and I expect lots of interesting advances in the near future.

Jens: First we can use machine learning to replace the “humans” in the loop. If our goal is to decrease the end-to-end time to deploy a system, we need to focus on ETL. My vision is a new database system, where you can download any data and it will automatically infer the schema, primary-foreign key relationships, integrity constraints, etc and build the database. It should work just as seamlessly as a file system. Two years ago, this vision seemed like an impossible one, but in the light of deep learning we can revisit these challenging ideas. We can also replace the database administrator in database tuning with the help of deep learning. We recently started the NoDBA project [7] and we are creating a start up, DeepTune, out of this effort [8]. Deep learning does better than the design advisory tool as it does not rely only on simple statistics, which may be off, but it can also rely on the richer raw data. So far we are only scratching the surface of what may be possible when using Reinforcement Learning to tune a database.

Second, we can use machine learning to improve components of a DB system. Query optimization is a difficult problem. We can train a neural network to compute physical plans not only by looking at cost estimates but also directly at the real data. We can also use a similar approach for index selection. However, index selection and query optimization are not the main bottlenecks anymore. Index structures are lightning fast; we can already perform 20 million operations per second on a standard hash table [9] and with advances in hardware, even the most naïve query optimization techniques work well.

Arun: Two data system areas that I think will really benefit from applying ML are query interfaces and unstructured data analytics. NLP and speech recognition are reborn in deep learning’s image. We should leverage such models to build new query interfaces using natural language or speech and enable data systems to support deeper text analytics.

Peter: I think there are tons of problems that can be solved by applying ML to systems – for example, you could probably rewrite an entire database (and write several PhD dissertations along the way) using ML-powered components. However, I wonder whether we’ll look back on the next five to ten years just as we look back today on Codd inventing the relational database, or, more recently, Jeff Dean building MapReduce and Matei Zaharia designing Spark. There is such tremendous demand for more useful tools and abstractions for ML. Moreover, designing and building these kinds of ML tools that people actually want to use is likely to dramatically expand our conception and understanding of data-intensive systems. For this reason, I think our systems research’s biggest impact on computing may come from looking outwards, not inwards: building tools for ML, as opposed to applying ML to systems.

Q3. How can we ensure that research from non-ML fields such as databases, HCI, etc., are appreciated and integrated within the ML community and vice versa?

Luna: The ML community has a joke: data scientists spend 90% of their time cleaning their data, and 10% of their time complaining about it. The database community have the mission to help them out. Indeed, because of the importance of data quality and data integration, other communities, in particular the ML communities, are starting to tackle these problems as well. To ensure our research efforts are appreciated and integrated in these communities, we should make sure that we collaborate with them and adopt the most advanced techniques to solve problems. In addition to sharpen the saws we have been using, we need to constantly enrich our tool sets.

Alkis: My experience is that members of the ML community are receptive to contributions from other communities, and in particular they recognize the expertise and experience that the database community can bring to their data-management problems. However, we need to solve relevant problems in their domain. It is tempting to take an ML problem, bring it over to the database domain and solve it in the context, assumptions, and ecosystem of a data-management system, but this can lead to a solution that cannot be transferred back to the ML domain, or even worse to a problem definition that is irrelevant for ML. We can avoid these outcomes by sanity-checking our problem definitions and assumptions with ML researchers and practitioners, so higher communication bandwidth between the two communities is key. In the other direction, it is up to us to understand the data-management problems that ML can address, and potentially engage with ML researchers to come up with meaningful solutions. My hunch is that ML researchers will be eager to work on interesting applications of ML!

Jens: A very important first step for the database community is making people understand what the different buzzwords mean . What is the difference between Big Data and Data Science? Machine Learning vs Artificial Intelligence? Every person would give a different answer. My understanding is that Data Science is this new science that tries to integrate all the different fields working with data. It is key for us to clarify that the database community represents “one third” of data science, the other thirds are machine learning and data mining. The data management aspects of data science do not end in a Jupyter notebook! Our contributions are core to the platforms that many researchers within the ML community are already using such as Apache Spark or Flink. We can think of Spark as an encyclopedia of database technologies with implementations of fundamental concepts like relational operators, grouping and co-grouping, query optimization, etc. Our contributions get lost in the buzzwords.

A second important step is for our community to understand that we are in competition with other systems that are much easier to use, but have comparable performance. The usability of database systems is a big issue that we need to tackle.

Arun: This is an important issue that I have been grappling with since my PhD days. I do not think any one big thing will make this happen in either direction for any pair of communities. It has to be a combination of sustained efforts, including creating new workshops and conference-related events, such as DEEM, which I’m co-chairing this year and XLDB, whose theme this year is “data meets ML/AI”, tutorials, events/competitions, and most importantly, interesting research papers that truly connect such areas. It is also important for people working in the relevant intersections to present and explain interesting open research challenges to the rest of the community.

Peter: We can build useful tools that users who work with ML want to use, and we can leverage these use cases to drive our research. Users and potential users of ML are clamoring for better tools, and they’re both on campus and all around us. Moreover, open source makes engagement easier than ever. However, achieving this potential will require us to engage with real users, and to get far outside our comfort zone and to consider a role beyond (just) relational databases.

Q4. What will the next big ML/Systems/Data research work be about?

Luna: Data quality and data integration are extremely important in industry but the problems are not solved yet. Part of the difficulty is that many tasks in this field cannot be easily modeled as a classification problem; instead of predicting a class, the output is about how to process data. For example, error detection, which decides if there is an error or not, is a classification problem, but error fixing, which corrects an error, can hardly be solved by classification. Recent deep learning models, in particular recurrent neural networks and program induction, allow us to learn how to “code” to fulfill our tasks, such as merging data or fixing mistakes. They provide new tools to solve data problems and may even give us breakthrough for data cleaning.

Alkis: At this point, ML can provide interesting solutions to hard problems but it also requires heavy experimentation in order to tune several non-trivial knobs. To make ML more widely available and accessible we will have to eliminate these knobs and essentially automate their tuning. The AutoML effort at Google is already looking into this direction with very promising results. And guess what: one big knob to tune for ML is the input data, so my hunch is that data-management techniques will play a big role in these efforts!

Jens: My feeling is that we should get away from performance-oriented research and try to tackle real, challenging problems. Problems like fully automatic ETL may sound impossible to solve now but the possible gain is much higher, even if it means taking more risks.

Arun: I do not have a crystal ball. But from speaking with a diverse spectrum of practitioners, I think the biggest next big thing in the ML, data, systems intersection is just the democratization of ML/AI-powered data analytics i.e., making it dramatically easier and cheaper for people with different levels of expertise and different operating constraints, on say accuracy, runtime, cost, usability, etc., to use ML/AI techniques for predictive analytics tasks. From the research standpoint, we are only beginning to understand the fundamentals of this fast-changing landscape. It is an exciting frontier for new research problems and ideas. This is why I am working on this topic!

Peter: There is a huge amount of research to be done on systems and tools for facilitating the entire ML development life cycle. In DAWN, our research currently includes systems for training data generation, model serving and monitoring, compilation for parallel and heterogeneous hardware, efficient training and inference, and video analytics. Perhaps surprisingly, we have found that working on tools for ML does not mean we have to start from scratch, with an empty intellectual toolbox. Many of our favorite systems and database optimization techniques apply to ML workloads, and can even be more powerful when applied in a statistical context. For example, we have found that predicate pushdown, cost-based optimization, and eventually consistent execution shine when applied to many ML workloads. Just as relational workloads stimulated decades of research into end-to-end query optimization, systems design, and hardware-efficient execution, I believe this next wave of ML workloads holds similar – and perhaps even greater – promise for the data-intensive systems community.

Interviewee Bios

Xin Luna Dong is a Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph. She was one of the major contributors to the Google Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the “Google Truth Machine” by Washington’s Post. She has co-authored book “Big Data Integration”, published 70+ papers in top conferences and journals, and given 30+ keynotes/invited-talks/tutorials. She got the VLDB Early Career Research Contribution Award for advancing the state of the art of knowledge fusion, and got the Best Demo award in SIGMOD 2005. She is the PC co-chair for SIGMOD 2018 and WAIM 2015, and served as an area chair for SIGMOD 2017, CIKM 2017, SIGMOD 2015, ICDE 2013, and CIKM 2011.

Alkis Polyzotis is a research scientist at Google Research, where he is currently leading the data-management projects in Google’s TensorFlow Extended (TFX) platform for production-grade machine learning. His interests include data management for machine learning, enterprise data search, and interactive data exploration. Before joining Google, he was a professor at UC Santa Cruz. He has received a PhD in Computer Sciences from the University of Wisconsin at Madison and a diploma in engineering from the National Tech. University of Athens, Greece.

Jens Dittrich is a Full Professor of Computer Science in the area of Databases, Data Management, and Big Data at Saarland University, Germany. Previous affiliations include U Marburg, SAP AG, and ETH Zurich. He received an Outrageous Ideas and Vision Paper Award at CIDR 2011, a BMBF VIP Grant in 2011, a best paper award at VLDB 2014, two CS teaching awards in 2011 and 2013, as well as several presentation awards including a qualification for the interdisciplinary German science slam finals in 2012 and three presentation awards at CIDR (2011, 2013, and 2015). He has been a PC member and area chair/group leader of prestigious international database conferences and journals such as PVLDB/VLDB, SIGMOD, ICDE, and VLDB Journal. He is on the scientific advisory board of Software AG. He was a keynote speaker at VLDB 2017: “Deep Learning (m)eats Databases”. At Saarland University he co-organizes the Data Science Summer School. His research focuses on fast access to big data including in particular: data analytics on large datasets, scalability, main-memory databases, database indexing, reproducability, and deep learning. He enjoys coding data science problems in Python, in particular using the keras and tensorflow libraries for Deep Learning. Since 2016 he has been working on a start-up at the intersection of deep learning and databases:

Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego. He is a member of the Database Lab and CNS and an affiliate member of the AI Group. His primary research interests are in data management and data systems and their intersection with machine learning/artificial intelligence. Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, and Microsoft. He is a recipient of the Best Paper Award at ACM SIGMOD 2014 and the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS.

Peter Bailis is an Assistant Professor of Computer Science at Stanford University. Peter’s research in the Future Data Systems group and DAWN project focuses on the design and implementation of post-database data-intensive systems. He is the recipient of the ACM SIGMOD Jim Gray Doctoral Dissertation Award, an NSF Graduate Research Fellowship, a Berkeley Fellowship for Graduate Study, best-of-conference citations for research appearing in both SIGMOD and VLDB, and the CRA Outstanding Undergraduate Researcher Award. He received a Ph.D. from UC Berkeley in 2015 and an A.B. from Harvard College in 2011, both in Computer Science.



Blogger Profiles

Azza Abouzied is an Assistant Professor of Computer Science at New York University, Abu Dhabi. Azza’s research work focuses on designing novel and intuitive data analytics tools and on supporting complex analytics natively within databases, such as specifying and solving objective optimization problems. Her work combines techniques from various fields such as UI-design, active learning and databases. She received her doctoral degree from Yale in 2013 and BSc (CS) from Dalhousie. She spent a year as a visiting scholar at UC Berkeley. She is the recipient of an NSERC Canada Graduate Scholarships-Doctoral Fellowship, and multiple research paper awards including a SIGMOD Research Highlight Award, a best of VLDB citation and a best CHI paper award. She is also one of the co-founders of Hadapt – a Big Data analytics platform.

Paolo Papotti got his Ph.D. degree from the University of Roma Tre (Italy) in 2007 and is an assistant professor (MdC) in the Data Science department at EURECOM (France) since 2017. Before joining EURECOM, he has been a senior scientist in the data analytics group at QCRI (Qatar) and an assistant professor at Arizona State University (USA). His research focuses on data integration and cleaning and it has been recognized with awards in SIGMOD and VLDB. His work has also been patented and successfully adopted in commercial products.

Copyright @ 2018, Azza Abouzied and Paolo Papotti, All rights reserved.