, which predicts data updates by treating each cell in a table as a random variable and runs inference on millions of these random variables, an approach that will be break most off-the-shelf inference engines. Applying an extensive set of technologies to scale out probabilistic inference is a hard challenge. Figure 4 shows a simple HoloClean architecture, where grounding the model, which involves (partially) materializing possible values that a random variable can take, is both an ML and a data management problem. To address this problem, various algorithms that the database community understand well can come to the rescue. These include (1) domain pruning and ranking the most probable values a certain database cell can take; (2) relaxing hard constraints such as denial constraints and functional dependencies to allow for reasoning about each random variable independently; and (3) implementing the various modules using efficient distributed algorithms for joins, large matrix multiplication, caching and many other data systems tricks.
3) Bringing humans in the loop
: In supervised models, humans are often used to manually label training data or to verify test data of the model for accuracy. In data cleaning, however, there isn’t “one human” role that fits all.
– Domain experts have to communicate their knowledge of high-level language or rules, or provide training for hard cases (such as rare classification, or non-obvious duplicates).
– ML experts have to manage the ML models and tune the increasing complex ML pipelines.
– Data systems administrators have to manage the large-scale source data sets often scattered across servers and in various modalities (JSON, structured in different engines, text, etc).
– Data cleaning experts have to help translate these rules and reason about the various signals that will contribute to finding and repairing errors.
– Data scientists have to control things like the required sample size for their predictions, the right data set to use, and the amount of cleaning they require based on the sensitivity of their analytics task to dirty data.
For any deployable cleaning system, all these human roles need to work together via well-specified interaction points with the cleaning tools.
Figure 4: Grounding machine learning models for cleaning is a scalability bottleneck
Luckily, the data community has been increasingly investing in proper and principled ways to bring humans in the loop for data management tasks, creating bridges between the DB and the HCI communities. Examples include the Falcon
project for scaling crowdsourcing for entity matching and deduplication; Tamr
‘s heavy use of multiple user roles to guide scheme mapping and record linkage; and visualization systems such as Zenvisage
to visualize and interact with large data sets. Integrating these intuitive user interactions for exploration presents great potential to better empower and guide data cleaning systems.
In summary, human-guided scalable ML pipelines for data cleaning are not only needed but might be the only way to provide complete and trustworthy data sets for effective analytics. However, several challenges exist. I discussed a few in this post, but by no means is the list complete. While I argued that we should stop dealing with data cleaning as a set of isolated database problems, I am also advocating that database techniques and research are essential to provision functional ML pipelines for cleaning. Of course, this post is not the first to make this general observation. Recent tutorials such as the one
by Google in SIGMOD 2017 on “Data Management Challenges in Production Machine Learning” and the one in SIGMOD 2018 on “Data Integration and Machine Learning” highlight the awareness of the various ways data management are used for ML pipelines and how ML can help with some hard data management problems. I think data cleaning is the poster child for ML/DB synergy.
Ihab Ilyas is a professor in the David R. Cheriton School of Computer Science at the University of Waterloo, where his main research is in the area of data management, with special interest in big data, data quality and integration, managing uncertain data, and information extraction. Ihab is also a co-founder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of an Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014). He is an ACM Distinguished Scientist. Ihab is also an elected member of the VLDB Endowment board of trustees, the elected vice chair of ACM SIGMOD, and an associate editor of the ACM Transactions on Database Systems (TODS). He is also the recipient of the Thomson Reuters Research Chair in Data Quality at the University of Waterloo. He received his PhD in computer science from Purdue University, West Lafayette.
Copyright @ 2018, Ihab Ilyas, All rights reserved.