October 9, 2018
After being largely neglected in the rush to capitalize on the promise and the potential of Big Data, data privacy and data stewardship issues have resurfaced in industry with a vengeance over the last year. This has been driven, in part, by the increased scrutiny by regulatory bodies all over the world and subsequent legislations, and in part, by high-profile data breaches and/or unethical data usage. Addressing these problems requires new tools, techniques, and systems, spanning a broad range of topics including statistical privacy, secure data systems, metadata management, provenance, model lifecycle management, versioning, and ethics. At the same time, this also raises an interesting possibility of fundamentally rethinking the end-to-end data lifecycle. However, despite early pioneering work and the central role of “data management”, we see insufficient work in our community addressing these issues. We summarize some of the key driving forces behind this movement, and potential research directions for the database community.
Responsible data stewardship has largely been an afterthought over the last decade as new techniques and tools were being rapidly developed to make it easier for us to manage and extract insights from large volumes of data. Increased awareness of the extent of personal data that is being collected by often unknown entities like data brokers (cf. FTC Report on “Data Brokers: A Call for Transparency and Accountability”, 2014), and wide-spread use of data science to make life-altering decisions, have brought these issues to the forefront slowly over the last few years.
Two prominent recent high-profile events include the Equifax data breach, and the Facebook Cambridge Analytica scandal. The former was a typical run-of-the-mill data breach, but at a very large scale and with fine-granularity sensitive information disclosed. The latter was not really a data breach per se. To briefly recap, a company called Cambridge Analytica purchased a dataset from a researcher, Dr. Aleksandr Kogan, who built a “Personality Test App” on the Facebook platform. That app was used by some 270,000 people, and downloaded not only the data for those users, but the data of their friends as well. This data included the detailed profile information, including the “likes” on social media posts for millions of users. The last bit of information is especially personal and valuable as it is highly predictive of a user’s preferences and personality traits. The data was subsequently used for political micro-targeting during the 2016 US Elections, elevating this issue to public attention.
The push towards responsible and accountable use of data has been slower in gaining momentum. However the rapid growth in the use of AI techniques (broadly construed here to mean everything from simple decision-making algorithms to deep learning) has brought increasingly greater scrutiny, especially when they are used to make important life-altering decisions like hiring, admissions, prison sentences, etc. Many studies have shown that the decisions made by such techniques reflect systematic bias, and often adversely affect protected and minority groups (disparate impact). The decisions are also often opaque; this is often because the algorithm designers are unwilling to provide sufficient access, but there are also serious technical challenges arising out of the use of complex algorithms like deep neural networks. (“Weapons of Math Destruction” by Cathy O’Neil is an excellent introduction to these topics). This has led ACM to put out a Statement on Algorithmic Accountability and Transparency, which lays out a set of principles that should be addressed during system development and deployment to minimize potential harms. A new ACM conference (ACM FAT*) is now being regularly held to bring together researchers working on these topics.
Regulatory bodies and governments around the world are scrambling to come up with new regulations to address the various data privacy issues. The European Union General Data Protection Regulation (GDPR) is perhaps the most comprehensive of those to date, and came in effect on May 25, 2018 with very large potential fines for non-compliance. Several countries around the world have adopted, or are in the process of adopting, regulations similar to GDPR. California enacted its own version, called California Consumer Protection Act (CCPA), a few months ago, which will take effect in 2020. There is also a push for a Federal Law in US to avoid a patchwork of regulations across the country, with a new Request for Comments put up by the Commerce Department recently and several Senate hearings that have been scheduled.
Most of these regulations are based on fundamentally similar principles:
1. Companies collecting, storing and processing personal data must be transparent about what data they are collecting, how they are using it, and who they are sharing it with; and they must be held accountable in case something goes wrong. This is irrespective of whether they received the data directly from the individuals or through another company.
2. The individuals should have control over their own data being held by companies. In particular, individuals should be able to access, correct, download, or delete their data if they wish (so-called “data subject access requests” under GDPR); and they should be able to withdraw consent easily at any point for any processing activity.
3. Companies should practice “privacy by design”, with one key tenet being data minimization; in other words, they should only collect the minimal personal data, sufficient for their purposes, and provide minimal access to that data to the entities that need it. The companies should also practice anonymization and pseudonymization to the extent possible.
4. Companies should undertake adequate security measures for protecting the personal data and properly document those measures.
Different regulations differ somewhat in the details. For example, GDPR has strong requirements around use and explainability of automated, AI-driven decision-making algorithms (although there is still some debate around the practical implications of those). In particular, GDPR requires meaningful human review of any significant automated decisions. CCPA, on the other hand, requires detailed documentation and tracking of onward transfer of data to third parties or partners, especially for data that is sold. Different regulations also differ on how much they emphasize “consent”, and whether that is “opt-in” or “opt-out” consent. In general, any multinational company is likely to have to comply with the union of all those regulations.
A key recent development, with far-reaching consequences, is the redefinition of “personal data”. In US, the focus for a long time has been on “personally identifiable information” (PII), which typically is construed to include name, social security number, date of birth, address, etc. However, it has been well-known, especially in our community, that stripping those attributes from the data (or masking them) does not render it anonymous. Both GDPR and CCPA have taken a much broader view of personal information to include any data that can be conceivably linked to an individual. Given the impossibility of anticipating what de-identification attacks may be possible in the future and what side-information may be available to aid those attacks, it is considered safe to assume that any data collected about an individual is personal, unless it is heavily aggregated.
The crucial operation need that arises out of these regulations is the ability to understand: (a) how and where personal data is stored (data inventory), (b) for what purposes it is used (context), (c) how it flows across different units and services within an organization (dataflow maps), and (d) how it flows across different organizations (onward transfer/provenance). Another key requirement is the ability to classify data into different categories of information, which often have different levels of “sensitivity” associated with them (data classification). All of these are fundamental to be able to “show” compliance through documentation (GDPR alone has about 40 different articles that require some evidence and documentation). Further, this information increasingly needs to be captured at a fine, individual-level granularity to be able to answer different types of data access, erasure or correction requests, and for auditing purposes.
However, even the most basic information here is very difficult to obtain in any medium to large enterprise today, much to the surprise of most privacy professionals and regulators (but not, we suspect, of the readers here). Fast-paced software development practices and lack of sufficient central coordination have resulted in a situation where constructing a reasonably complete data or compute inventory itself is a big and lengthy project. Data transfers across organizations, through publicly-available APIs or ad hoc mechanisms like secure file transfers, are especially problematic to track and reason about. In many cases today, personal data may be transferred to a third party partner network (for marketing analytics, A/B testing, etc.) without the company ever seeing the data. Another challenge here is correlating an individual’s data across a multitude of operational systems, possibly across organizational boundaries (entity resolution).
In our opinion, this is a new, wide-open, and fertile ground for data management research. The key questions here revolve around how different data storage, processing, and middleware systems are used in conjunction to achieve specific business goals end to end. Solving these in a principled manner requires rethinking end-to-end data lifecycle, and potentially developing new data-centric systems, i.e., can we design data management systems that enable collaboration and sharing while retaining full control and auditability and without any performance hit? A promising research direction here is to explore “taint-tracking”, a popular security mechanism for tracking data-flow dependencies, both in high-level languages and at the machine code level. We also need to re-visit the trend towards using a different special-purpose system for each different use-case; the resulting duplication of data not only makes it harder to guarantee compliance, but also makes it a challenge to handle data deletion or correction requests and increases the risks of data breaches. We also need to take a careful look at increasingly prevalent use of APIs (which can be thought of as “views”) and how to properly manage a large number of APIs that are talking with one another.
Capturing and querying provenance and context are also closely related research areas since several of the questions above are about one or the other. Systems that can automatically track provenance and/or context without significant user involvement, especially across organizational boundaries, will not only reduce the burden on data scientists or engineers but also make it much easier to provide transparency into how automated decisions are made. Ground (Berkeley) and ProvDB (UMD) are among a handful of recent research projects that are focused on this problem. Managing large volumes of such provenance and context information, understanding how to present it to a user, and analyzing it for privacy-specific insights, remain largely open questions.
Several types of data that are routinely mined for user behaviour like location data, user profiles and consumer purchase history are deemed personal data under the new legislations. Therefore, techniques like de-identification can be powerful tools in minimizing the impact of breaches and for companies to safely monetize their data. However, there is a rich literature that shows how traditional techniques for de-identifying data (e.g., data swapping and suppression that is used in the federal agencies; pseudonymization, which was used in prior releases of Netflix and AOL; or stripping PII, used in the medical context to render data HIPAA compliant) provide little or no guarantees of privacy.
The database community has been a leader in the field of private data analysis and dissemination — starting with the early work in K-anonymity and L-diversity leading to the recent work on algorithms with strong provable guarantees of differential privacy. This is an especially exciting time for research in differential privacy as:
(a) it is increasingly being accepted in academic circles as a gold standard for privacy, and
(b) several organizations like the US Census Bureau, Apple, Google and Uber have recently deployed data products that use differential privacy either to generate synthetic data derived from their sensitive databases, or to perturb each individual’s data before analyzing them.
Nevertheless, the widespread adoption of differential privacy will require several open questions to be solved. The foremost of them is the lack of system tools for differential privacy. Each of the deployments mentioned above has required a team of experts in differential privacy to design the privacy algorithms for a number of reasons.
First, differential privacy was originally defined for single flat tables. Extending them to relational databases requires specifying and reasoning about what the privacy requirements are of multiple entities and their relationships, allowing public and private tables, and the impact of constraints in databases (foreign keys, inclusion dependencies) on the privacy guarantee.
Second, there are several accurate differentially private algorithms that are known for any given task, but there are no reference implementations or benchmarks. In fact, the first benchmark study, DPBench, showed that for task of answering counting queries on one or two dimensions, there is no single best algorithm in terms of accuracy. To complicate matters, accuracy depends on the characteristics of the data, which are private. Moreover, algorithms that provide state of the art accuracy for a data analysis workflow, modify the workflow in complex ways to achieve high accuracy. Proving that these algorithms satisfy differential privacy is non-trivial, and there several known examples of human proof being wrong.
Thus, an important research challenge is designing systems that permit exploration of data through a declarative query interface, but under the hood automatically generate differentially private algorithms (and a proof of privacy). Recent work in the database community on frameworks for authoring privacy policies like Blowfish, and usable systems for authoring provable DP algorithms like Ektelo and Pythia are a step in the right direction, but more work is needed.
A key aspect of the new legislations is to describe and enforce limits on data access so that they are adequate, relevant and limited to what is necessary. Data categorization (as sensitive or non-sensitive), access control and encryption are important tools for solving this problem. While in traditional settings, it is fairly easy to determine whether a table or an attribute in a relational schema contributes personal information, it not so easy in new data applications. For instance, in a smart building, one might encounter a table that logs when smart light sensors turn on or off. On its own, this does not track any sensitive information. However, when joined with information about locations of the sensors in the building and assignments of employees to cubicles, the data now can be used to infer when an employee is in office, when they took a bathroom break, etc., thus becoming personal (and often sensitive). Thinking about sensitivity and access control in the context of data fusion is an important research challenge.
With the growing data demands, several small to medium-size organizations outsource their data processing to third party servers. These organizations are liable for the use of personal data even if they never see the data (as they control and orchestrate the data collection and processing). Thus, an important alternative to limit their liability is to outsource encrypted data in such a way that third parties can store and process encrypted data. With rapid advances in fully homomorphic encryption, this is becoming more and more achievable. Several systems have arisen in the space of encrypted database, starting with the early work by Hacigumus et al., to recent research prototypes like CryptDB, to deployed products like Microsoft Always Encrypted Database Engine (based on the Cipherbase system). However, there are well-known fundamental limits on what can be done efficiently on encrypted data. In particular, there are negative results showing that joins on encrypted data can not be done efficiently unless some information about the underlying data are leaked to the third party that is processing the data. So all of the aforementioned systems either use weaker forms of encryption (like order-preserving or order-revealing encryption) and/or leak frequencies of attribute values. There is a growing literature on combining inference techniques to reconstruct the database from encrypted data using frequencies and ordering leaked by these systems. To counter such attacks, there is a rising research area on composing differential privacy and encryption/secure computation to overcome this limitation. Differential privacy provides a strong theoretical guarantee of being able to leak information without adversaries being able to reconstruct individual records in the database, thus achieving efficient joins as well as security. While some initial ideas, like Shrinkwrap, have been proposed in this direction, these ideas need to be tested in real systems.
A novel aspect of GDPR (and several other regulations) is an individual’s right to erasure (also often called “right to be forgotten”). Scouring a database for an individual’s mention amongst billions of records/files (e.g., tagging an individual’s presence in photos) and deleting them definitely poses performance challenges to databases if it is not a new functionality altogether. Third-party data transfers also add another wrinkle and can required detailed tracking of how individual data items were transferred through third-party networks. While the law was articulated with search engines like Google in mind, there are important definitional questions around the erasure of derived data. For instance, if a system creates a record by joining two different records owned by two different parties, would an erasure request from one party require the derived record to also be deleted? Also, there is much recent work on membership attacks that show how models built by ML algorithms (like deep neural networks) contain information about individual records that appear in the training set. Membership attacks aim to reconstruct records from the input training set. Thus, would an individual’s contribution to statistics and machine learning models also need to be “erased”? This is a new and challenging problem, and again (not surprisingly) differential privacy could be a defense as it hides the presence or absence of individual records but allows models be learnt on the data.
Audit mechanisms, access control, logging, and versioning are other important mechanisms to ensure that controllers and processors of data are held accountable. To briefly elaborate:
1. There has been much prior work in the database community on auditing computation either to verify that access control policies are not being violated and/or to ensure individual records are not revealed. New functionality that databases might need to support are purpose-specific auditing, where the computation on a set of record is regulated by a human-readable or machine-readable purpose.
2. Similarly, although databases typically feature rich access-control mechanisms, those do not typically apply once the data leaves the database. New access control mechanisms are needed that can address the question throughout the entire lifecycle of a data item, especially as it moves across organizational boundaries (e.g., when data is shared with partners or researchers).
3. Detailed logging and data versioning are both essential to be able to properly conduct forensic audits. Although there is recent work on data versioning (e.g., DataHub), much work still needs to be done to make it a first-class construct in data management systems.
Our goal with this article was to summarize what we see as some of the most pressing questions faced by industry today around transparent and accountable use of data. We welcome any comments or thoughts you have.
Amol Deshpande is a Professor in the Department of Computer Science at Maryland, and Co-Founder and Chief Scientist at WireWheel, Inc., which is building a comprehensive platform to help companies comply with regulations for data privacy including the EU GDPR, CCPA, and others. His current research interests include graph data management, data privacy, provenance, versioning, and collaborative data platforms.
Ashwin Machanavajjhala is an Associate Professor in the Department of Computer Science, Duke University. Previously, he was a Senior Research Scientist in the Knowledge Management group at Yahoo! Research. His primary research interests lie in algorithms for ensuring privacy in statistical databases and augmented reality applications. In collaboration with the US Census Bureau, he is credited with developing the first real-world data product, OnTheMap, with provable guarantees of differential privacy.
Copyright @ 2018, Amol Deshpande, Ashwin Machanavajjhala, All rights reserved.