January 20, 2021
The 3rd decade of the 3rd millennium had a dramatic start, with a pandemic that impacted our lives all over the globe. This indicates extraordinary events and rapid changes, ranging from politics and climate change to our social life and work habits. Like it or not, we are getting used to new routines such as working from home, ordering food and clothes online, and even visiting our loved ones online. Recent technological advancements, coupled with Big data and advanced algorithms, have amplified this shift even further.
As data science and technologies put down more roots in our lives, their drawbacks and potential harms become more evident. We all have undoubtedly faced, discussed, or at least expect to hear many of these concerns in our daily lives and during our research. A major challenge is that these issues are interdisciplinary, not in the form of traditional technical problems we have been trained for.
On the other hand, due to the lack of clarity and confusions associated with social concepts, many of us may prefer to stay in our cozy corner of building algorithms and systems, rather than getting involved with such interdisciplinary challenges. One way or the other, somebody should be responsible for the unprecedented impacts of the “creatures” we build for joy, for making money, or even with the intention of making the world a better place, and as a wise man once said “if not us then who?”.
Fortunately, acknowledging the importance of responsible data science, different computer science research communities have taken the issue seriously: (a) a fast-growing number of publications, including several best papers [13,18,20], in major CS conferences are on responsible data science, and (b) the new conference ACM FAccT has been dedicated to this topic. In Database venues alone, there have been publications, keynotes, tutorials, and panels on responsible data science in SIGMOD, VLDB, and ICDE. Besides, multiple companies have introduced open-source toolkits for integrating and standardizing these algorithms. One of these examples is IBM’s AI Fairness 360 Open Source Toolkit (AIF360).
Despite the extensive effort in the past few years, at least in my opinion, the push for responsible data science has not been triumphant, not significantly impacting the practices of data science in the real world. In this article, I would like to investigate reasons for this and propose a resolution, while underscoring the critical role of data management for responsible data science.
Major components that shape data-driven decision systems are Data, Algorithm, and Human experts (data scientists). A model or a data-driven algorithm is developed by data scientists, who often use code and packages that require minimal knowledge/effort and come with default and, sometimes, (semi) automatic parameter tuning. Data is undoubtedly the cornerstone of any data-driven algorithm. It is known that “an algorithm is only as good as the data it works with” [7]. On the contrary, data scientists often use “found data” which they have limited or no control collecting.
In my view, enabling responsible data science in practice requires a pipeline of user-friendly means for (1) data preparation, (2) exploratory algorithm design, and (3) generating fitness-for-use signals. Data management is necessary for enabling this pipeline. This critical role has been recognized for more established aspects of responsible data science such as privacy (read this blog post for more details). However, the relatively recent and unstable topics such as algorithmic fairness are still at preliminary stages. While acknowledging different aspects of responsible data science, including stability/robustness, transparency, equity, trust, and accountability, in the rest of the article I take fairness as the key topic. Similar arguments can be made for all aspects of responsible data science. I skip reviewing the definitions and basics, and refer the readers to watch this video, read this and this blog posts, study this book, and read CACM’s article for more on algorithmic fairness. Still, I shall use the following example to intuitively explain some of the concepts and challenges.
Probably the best example to consider is the well-known, controversial COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) model, developed by Northpointe Inc (now Equivant). Being used at different stages of the judicial system such as setting bail, COMPAS assigns each defendant a recidivism score, estimating the likelihood that they will re-offend. The recidivism scores are highly criticized by ProPublica for being unfair, as it turns out Black defendants are more likely to be classified as high-risk ( I suggest taking a look at ProPublica’s article, if you haven’t before).
In response, Northpointe argued that COMPAS is fair as it provides similar accuracy and predictive parity across the two races. US court also defended the scores by pointing out to the parity of false positive/negative rates between the two demographic groups. In the end, the Wisconsin Supreme Court ruled in favor of continuing to use COMPAS.
Data is arguably the most important component of data-driven systems. To highlight this in the context of responsible data science, let us begin the section by considering impossibility theorems [9] and trade-os [16] between different fairness measures. These theorems are well-known and widely used to illustrate that not all fairness measures can be satisfied at the same time. For instance, even though the COMPAS scores are fair based on accuracy, predictive parity, and false-positive/negative rates, they do not satisfy demographic parity. For a better understanding of these trade-os (and fairness in ML in general), I suggest watching this video by Chouldechova and this video by Kleinberg, or taking a look at this blog post (Section 4.7).
Looking at the proofs of impossibility theorems, one can notice that they
assume that data/environment is “biased“. That is, for example, the target variable (class label) is not independent of sensitive attributes (e.g. race), or the data has unequal base rate (i.e., having not equal counts for different demographic groups). The theorems do not hold if these assumptions are not valid, i.e., it is not impossible to satisfy all fairness measures if there is no “bias” in data. Let me use a simple toy example to show this: consider a dataset with two demographic groups blue and red. Suppose every red (resp. blue) point has a blue (resp. red) counterpart with the same values on all other attributes except the color (to be more precise, it is not possible to predict the color of the point with a probability higher than 0.5). In this toy example, any model satisfies all measures of group fairness. That is because there is a blue (resp. red) equivalent with the same values and the same model outcome for every red (resp. blue) point.
If data is not biased, data-driven decision making becomes even more valuable as it removes human bias and only looks at the data dispassionately. However, data, especially social data, is almost always biased as it inherently reflects historical biases and stereotypes [19]. Besides, data collection and representation methods often introduce additional bias. A popular case is selection bias, where the sources from which data is collected do not provide an unbiased and representative view of the society. For example, HP used images of its (mostly white) engineers to train its face detection software that failed to detect black faces, Nikon failed to include enough East Asians in the data used for training its cameras’ “blinking detection” software and classified many (naturally narrow) Asian eyes as closed even when they were open, and an early image recognition algorithm released by Google had not been trained on enough dark-skinned faces, hence labeled an image of two African Americans as “gorilla”. One can think of similar issues when using publicly available data from sources such as Twitter.
Using biased data without paying attention to societal impacts can create a feedback loop, and even amplify discrimination in society. The fact that data-driven algorithms and ML models are “Stochastic Parrots” (as referred by Timnit Gebru), reflecting biases in the data, accentuates the need for curating data, the direct role of data management for responsible data science.
Existing work for data preparation and preprocessing techniques for algorithmic fairness include [8,10,15]. One of my favorites among these is [10], which operates on the basis of the assumption that if demographic groups cannot be predicated by the training data, generated models will not have disparate impact. A major work by the data management community is the SIGMOD’19 best paper [20] that formulates the problem as a causal database repair problem, proving sufficient conditions for fair classifiers in terms of admissible variables. [6, 17] propose the notion of coverage to ensure that there are enough samples in the dataset for demographic subgroups (e.g. Hispanic Female). I should also mention the interesting topic of debiasing object representations (aka embeddings), about which you can further read here.
Existing work on data preparation for responsible data science is at its early stage, requiring strong efforts to revisit every step of data management and the life cycle of data [14], developing proper tools, strategies, and metrics. The following is a sample of the long list of data management research required for responsibly preparing data, taken from our recent tutorial [3]:
– Correcting bias in data. Mitigating bias through the pipeline of data preparation is a necessary step towards algorithmic fairness. The database community has a lot to oer here, given its expertise in data extraction, annotation, discovery, cleaning, integration, crowd-sourcing, aggregation, outlier detection, etc. Removing bias from the input data can be viewed as a special case of data cleaning where the goal is to replace, modify, or delete problematic tuples or values that cause bias.
– Data representation. Representation choices are critical design decisions, traditionally approached with performance as the central objective. These decisions can also impact fairness. For example, bucketization choices can lead to very different analysis results.
– Integrating to databases. Despite the extensive efforts within the database community, there still exists a need to integrate concepts such as fairness into database query processing and data management systems, and to add declarative functions to SQL.
A basic assumption in data-driven systems is that data is an iid sample set
collected according to the environment’s underlying data distribution. A major reason for bias in data is the underlying biases due to historical discrimination, false stereotypes, and racist policies such as redlining. Any interference to remove these biases deviates from the iid-sample assumption, hence, affects the efficiency of final algorithms. This trade-off makes it hard to acquire completely unbiased data. Even if so, an algorithm that looks fair according to collected data may not be fair in practice. As a result, a major portion of research on algorithmic fairness has either been on algorithm modification (in-process) and post-processing algorithm outputs to make them fair [11].
In-process and post-process techniques have their own challenges as (i) the
performance of the algorithms after post-processing is questionable and (ii) in-process techniques often have efficiency and scalability issues since their optimization problems are non-convex or NP-complete. Still, the main challenge of responsible data science is the lack of clarity of concepts, that lead to trade-offs between different measures (e.g. trade-off between model performance and fairness, individual fairness and group fairness, and different notions of group fairness). One cannot expect (ordinary) data scientists to understand and to be able to handle such challenges without handy and user-friendly toolboxes.
On top of these complexities and confusion, there is a lack of guidelines from legal authorities. While the EU’s General Data Protection Regulation and California’s Consumer Privacy Act require compliance within their footprints, legal and policy requirements aimed at relieving technologically driven risk continue to lag behind what is technologically possible. We note that increasingly both governmental agencies and private companies seek to engage in the “ethical” use of data-driven algorithms. To protect consumers and their own reputations, such organizations recognize that they need to do more than simply comply with the law. They need to handle data ethically and responsibly. They would also like to demonstrate the responsible design of their algorithms through convincing reasoning.
Due to the above challenges, confusions, and practical complications, it remains unclear for data scientists, companies, and governmental agencies how to conduct data science ethically. The trade-os and impossibility theorems show that fairness definitions cannot be satisfied all together, and that achieving fairness may diminish the performance of developed algorithms. Fortunately, this does not mean that different measures cannot be partially satisfied at the same time without significantly impacting the performance of the model. Still the fairness requirements to consider, the degree of regulations, and how to satisfy those requirements are open questions to a data scientist.
In my opinion, exploratory algorithm design is the way to go for responsible
algorithm design. Shown in Figure 5, exploratory algorithm design provides a human-in-the-loop system that, in a cycle, identifies potential performance and unfairness issues, guided by the user, resolves them to generate a new model and continues this cycle until a satisfactory model with reasonable performance and fairness (from different aspects) is achieved. Such a system arms the data scientist with a good understanding of the trade-os, which enables reasoning about the algorithm design, the choice of parameters, and why certain unfairness could have not been completely resolved.
Inevitably, any attempt to enable exploratory algorithm design first requires an audit unit and, more importantly, a resolve unit that can fix the problems marked by the data scientist. Fortunately, as pointed out above, there have been extensive attempts in the past few years for enabling these units. There, however, is a major challenge towards adapting them here: efficiency. Indeed, an exploratory design with human-in-the-loop should be interactive. To the best of my knowledge, existing work is far from being interactive.
We proposed an exploratory design system for fair and stable rankings that
are based on expert-designed scores in the form of f θ= θΤX [4,5]. For fair ranking, given an initial user-defined weight vector θ, the system returns the most similar vector to θ whose output (ranking) satisfies the fairness requirements. The expert can then accept the system suggestion, or use it to explore different options before finalizing their scoring function. Efficiency is the major challenge we faced, as it turned out the problem is cursed by dimensionality. We needed to devise approximation techniques ranging from indexing, space-partitioning, early termination, and threshold-based pruning to sampling and Monte-Carlo methods. Still, our current system [12] does not fully satisfy its promise of being interactive in adversarial cases. We observed similar efficiency challenges across different tasks, including classification [1], and assignment [2].
The database community has been at the forefront of grappling with big data challenges and has developed numerous techniques for efficiently processing and analyzing massive datasets. These techniques often originate from solving core data management challenges but then find their way into effectively addressing the needs of big data analytics. A notable example is boosting the efficiency of machine learning using database techniques as varied as materialization, join optimization, query rewriting for efficiency, query progress estimation, federated databases, etc. Our help is needed to extend these efforts to responsible data science, in particular, for enabling exploratory algorithm design.
I would like to begin the last section with a quote from Von Neumann: “Truth…is much too complicated to allow for anything but approximation“. Any system based on approximation and prediction is far from perfect. Data preparation and algorithm design must be responsible. But it is not enough as, in the end, to some degree, data and algorithms will still have pitfalls such as bias and unfairness.
Not every data is fit to be used for every data science task, and not every
algorithm is t to suggest a decision for every individual. In my view, the final piece for responsible data science in practice is to arm data and algorithms with shields to identify fitness for use. The fitnessfor use enables providing warning signals when, for example, a model outcome is questionable for decision making.
Works such as nutritional labels, lead by Julia Stoyanovich and Bill Howe, are initial attempts to provide such shields [21]. For example, our system Mithralabel [22] is an interactive tool that, given a dataset, provides immediate information about the fitness of the dataset for the task at hand, and warnings for a data scientist when warranted. Some of the works in the area of trustworthy AI may also be useful for generating fitness for use warnings. Just like data preparation and exploratory algorithm design, existing work for this component of responsible data science in practice is preliminary and far from being satisfactory. Generating (semi-)automatic signals and tools such as nutritional labels, are necessary for different steps of Big data lifetime and different data science tasks.
In this article, I investigated the challenges towards enabling responsible data science in practice, underscoring the critical role of data management. In my opinion, achieving this goal requires user-friendly means for data preparation, exploratory algorithm design, and providing fitness-for-use signals. The data management community has the expertise and is needed to successfully develop these systems.
Abolfazl Asudeh is an assistant professor at the Computer Science department of the University of Illinois at Chicago and the director of Innovative Data Exploration Laboratory (InDeX Lab). His research spans to different aspects of Big Data Exploration and Data Science, including data management, information retrieval, and data mining, for which he aims to find efficient, accurate, and scalable algorithmic solutions. Responsible Data Science and Algorithmic Fairness is his current focus in research.
[1] Hadis Anahideh, Abolfazl Asudeh, and Saravanan Thirumuruganathan. Fair active learning. CoRR, abs/2001.01796, 2020.
[2] Abolfazl Asudeh, Tanya Berger-Wolf, Bhaskar DasGupta, and Anastasios Sidiropoulos. Maximizing coverage while ensuring fairness: a tale of conflicting objective. CoRR, abs/2007.08069, 2020.
[3] Abolfazl Asudeh and HV Jagadish. Fairly evaluating and scoring items in a data set. Proceedings of the VLDB Endowment, 13(12):3445-3448, 2020.
[4] Abolfazl Asudeh, HV Jagadish, Gerome Miklau, and Julia Stoyanovich. On obtaining stable rankings. Proceedings of the VLDB Endowment, 12(3):237-250, 2018.
[5] Abolfazl Asudeh, HV Jagadish, Julia Stoyanovich, and Gautam Das. Designing fair ranking schemes. In Proceedings of the 2019 International Conference on Management of Data, pages 1259-1276, 2019.
[6] Abolfazl Asudeh, Zhongjun Jin, and HV Jagadish. Assessing and remedying coverage for a given dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 554-565. IEEE, 2019.
[7] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. L. Rev., 104:671, 2016.
[8] Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems, pages 3992-4001, 2017.
[9] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153-163, 2017.
[10] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259-268. ACM, 2015.
[11] Sorelle A Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P Hamilton, and Derek Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the conference on fairness, accountability, and transparency, pages 329-338, 2019.
[12] Yifan Guan, Abolfazl Asudeh, Pranav Mayuram, HV Jagadish, Julia Stoyanovich, Gerome Miklau, and Gautam Das. Mithraranking: A system for responsible ranking design. In Proceedings of the 2019 International Conference on Management of Data, pages 1913-1916, 2019.
[13] Tatsunori B Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. arXiv preprint arXiv:1806.08010, 2018.
[14] Hosagrahar V Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M Patel, Raghu Ramakrishnan, and Cyrus Shahabi. Big data and its technical challenges. Communications of the ACM, 57(7):86-94, 2014.
[15] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1-33, 2012.
[16] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-os in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016.
[17] Yin Line, Yifan Guan, Abolfazl Asudeh, and Jagadish H. V. Identifying insufficient data coverage in databases with multiple relations. accepted in VLDB, 2020.
[18] Lydia T Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018.
[19] Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kiciman. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data, 2:13, 2019.
[20] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. Interventional fairness: Causal database repair for algorithmic fairness. In Proceedings of the 2019 International Conference on Management of Data, pages 793-810, 2019.
[21] Julia Stoyanovich and Bill Howe. Nutritional labels for data and models. IEEE Data Eng. Bull., 42(3):13-23, 2019.
[22] Chenkai Sun, Abolfazl Asudeh, HV Jagadish, Bill Howe, and Julia Stoyanovich. Mithralabel: Flexible dataset nutritional labels for responsible data science. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 2893-2896, 2019.
Copyright @ 2021, Abolfazl Asudeh, All rights reserved.