November 20, 2015
What are the skills and competencies that citizens should possess to be considered data-literate? These skills should allow a person to critically judge data collection, the raw data that it is based on, the analysis processes, and the quality of the results. These skills are needed to help individuals make informed decisions about, e.g., the implications of disclosing some personal information to an application, the risk/benefit trade-offs of vaccines, or the comparative impact of government investment in foreign aid vs. military spending. Data literacy skills are particularly important for decision makers, who currently often lack them.
Important questions are how to teach data literacy in schools and which concepts to cover at what age. Children start accessing information on the Web at a very early age, and so it is important to expose them to simple data analysis experiments already in elementary school. At this age one can explain basic data quality concepts to students, demonstrating that they need to be skeptical of data found on the Web, and explaining the value of credit attribution. Middle school and high school students should be gradually exposed to additional data literacy concepts, including how to check the quality of analysis protocols and data visualization methods. Students should be taught basic principles of probability and statistics, including, e.g., the distinction between correlation and causality. This curriculum should also focus on turning students from informed spectators into actors, by asking them to practice building their own datasets and analyzing them.
Nonetheless, while direct government regulation is definitely a complex issue, and may not be feasible today, there is still a need for government awareness and involvement, to ensure that responsible data analysis is receiving proper attention and to articulate high-level principles as guidelines. Government involvement should then be by a combination of regulation and incentives. For example, governments may provide incentives to organizations that share their data and code, in support of transparency and easier verification of data-responsible practices. Of course, making all data and code open cannot be required for reasons of protecting privacy of individuals and competitive advantage of companies. For this reason, governments should also support other means of facilitating responsibility verification. Finally, as already mentioned, the integration of data from different applications and different companies is the most serious issue. The control and limitation of data integration should also be part of the government mission.
A series of initiatives is converging towards giving individual users more control over how others gather and use their personal data . Examples of such initiatives are Smart Disclosure in the U.S., enabling more than 40 million Americans to download their own health data using the “Blue Button” on their health insurance provider’s website, and MesInfos in France, where several large companies including network operators, banks and retailers have agreed to explore and experiment with the sharing with customers of the personal data they hold about them. Such moves clearly improve transparency.
While design and verification are related, they also differ in an important way. In the case of design, data and algorithms are typically fully available (responsibility from within, in white-box mode). In contrast, verification may have to work with limited access to data and code, typically through a predefined set of access methods (responsibility from the outside, in black-box mode). In both cases, specific properties to be designed or verified must be specified by laws, contracts or commitments by a company to its users. These properties must be translated into technical specifications and must in turn be verified automatically by algorithms. This raises a number of technical challenges, e.g., design a setting where the responsibility of a recommendation engine is controlled with limited disclosure of its code and its data, or design a fair ranking algorithm that does not favor only the most popular items, but also strive for diversity in results.
To verify the responsibility of an algorithm a posteriori, one can either analyze its program or test its behavior on different inputs. Program analysis is closely related to theorem proving in mathematics, while observation of behavior is related to how real-world phenomena such as a heart, a galaxy or an atmospheric cloud are studied in the sciences. Program analysis is complex and, with current technology, limited to small critical pieces of software, for example in the aerospace or nuclear industries. However, this task becomes more feasible if it is provisioned for at application design time. This approach, which we term responsibility by design, is in-line with a recent approach to ensuring privacy, called privacy by design. It is also encouraging that it is usually easier to check a proof that a program is behaving in a particular way than it is to find such a proof. Testing the behavior of a complex program by observation is also not easy because, depending on the input, very many different outputs are possible. Powerful verification methods will have to rely on a combination of program analysis and observation, both of which are complex and computationally expensive. This is why today most organizations rarely invest in verifying security and privacy properties, and more generally in properties that ensure responsibility.
As is apparent from the discussion so far, algorithms that are important here are decidedly data-centric. They may even rely on data during their development. For example, a machine-learning algorithm is trained on data before it can be used. Data is the basic material on which everything in this environment is based, and it is thus extremely important to reason about the responsibility properties of the data itself. Another essential ingredient is metadata accompanying the data that guarantees its authenticity, explains its origin and history of derivation (known as provenance), and, more generally, assigns a meaningful interpretation to data. To both design and verify responsible data analysis environments, technology must be developed to answer the following questions. Is a particular dataset biased? Are the results of a particular data analysis method reliable with high confidence?
Big data technology has immense power. This power comes with a great risk to our society if technology is used irresponsibly. All stakeholders of the big data ecosystem, including scientists and engineers, but also commercial companies, users and governments, have a responsibility to ensure that technology is used in a way that is fair and transparent, and that it is equally available to all. Let us coordinate our efforts along the dimensions of education, public policy, user organization and technology to turn the promise of big data into societal progress!
Copyright @ 2015, Serge Abiteboul, Julia Stoyanovich, All rights reserved.