Letizia Tanca
Letizia Tanca

Wisdom: a “double V” for Big Data

Analytics, Big Data, data exploration
Some months ago, I found an exciting piece of news on the Web: a top manager proclaimed [1]: “If the most important thing you offer is data, you are in trouble. (….) Algorithms is where the real value lies. Algorithms define action.”

How interesting! It seems to suggest that everybody, including computer scientists and in particular database people, had not realized that “just data” is not enough? Good news. Which is certainly no news for our community, whose general objective is to find how to store data and process it, and which has repeated this mantra since the start of the Big Data ferment: just see [2] as well as the tremendous amount of papers that have been written to date about this topic. So, let me embark on the umpteenth attempt to say something intelligent. Hard, considering all the contributions that have been delivered already.

Make end users wiser

In Italian, the letter W is called “double V”, instead of “double U” as in English. That is why I propose, on the side of the other V’s, a double one, for WISDOM: not only we want to make sense of the data, whether they are big or not, but we can, and should, extract from them a worth that makes us wiser, doubling the Value.

In order to do so, analytics is not enough. Timo Elliott in his blog [3] recites “What’s the difference between Business Analytics and Business Intelligence? The correct answer is: everybody has an opinion, but nobody knows, and you shouldn’t care.” A more sober answer might be that BI is a largely abused term, often adopted to denote very simplistic activities to derive insight from the data, while, in fact, BI should involve something more than analytics, that is, a mix of the two activities of induction and deduction that ultimately mimics human reasoning and constitutes the two-fold basis of Artificial Intelligence. Thus, my first thesis is: deriving wisdom from the data means engaging in a research that always keeps an open mind towards both activities.

As Oren Etzioni says [4]: “Background knowledge, reasoning, and more sophisticated semantic models are necessary to take predictive accuracy to the next level.” Note that Inductive Databases [5], and especially Inductive Logic Programming (ILP) [6,7], pursue the formalization of this since the 90’s. Inductive Databases integrate data with data mining results, in such a way that the discovered patterns become first-class citizens and are thus manipulated together with the data. ILP, in turn, integrates inductive learning and Logic Programming with the aim to construct logical theories from positive and negative observations, and thus obtain new knowledge “from experience”. Today, the database community can give a much valuable contribution: the data available for knowledge induction are so abundant that — provided that the sources and means of extraction are reliable — the knowledge obtained is based on massive evidence. Therefore, we are called to devise suitable computation and storage means, so that these ambitious techniques can be reliably applied to the amount of information that makes the difference.

The second, and no less important point I want to raise is related to usability by end users. In his Sigmod 2007 keynote [8], Jag Jagadish underlined the shortcomings of database systems from the usability viewpoint, not only for an end-user but even for an experienced programmer. He suggested solutions to problems like unknown query languages and database schemas, unknown data values and data provenance, complexity of the schema. Further aspects he pointed out are the utility of result explanations, support for understanding intermediate results, and the possibility to add and modify structure during the database life. So the second point I advocate here is a similar attitude for allowing end users to face the challenge posed by information abundance — often excess — and put analytics into direct use for them to “become wiser”.

As a researcher, I have always been fascinated by these issues. After my early work on deductive databases, many of the methods and systems I proposed contain a mix of induction and deduction to help users, and especially end users, access data in an easy way and make sense of them. For example, in the last decade, we worked at helping end users with the problem of information overload [9], by proposing a design model and associated methodology for context-aware databases. By context-awareness, each user is served, in each situation of use (context), the information that is most appropriate, thus avoiding disorientation. In the early years of this research, we adopted a top-down approach completely based on the a-priori design of the information associated with each considered context (the contextual view). More recently, we have been investigating the possibility to find the information related to each context by mining the data previously chosen by the user in that given context, possibly ranked in order of preference. The mined information is then further manipulated, to produce the actual contextual data definition expressed intensionally as a view, so that new data added to the database can also be taken into account when that context is encountered again. In this research, we find both aspects: the conceptual machinery, namely the association of inductive and deductive reasoning, and the ultimate target, i.e. supporting the end user.

The exploratory computing vision

Recently, I am attracted by the richness of challenges posed by database exploration [10, 11], a paradigm that takes different shapes, among which the most related to data analysis is the one motivated by the exploratory computing vision [12]. In exploratory computing, HCI and DB research are equally crucial for the production of systems that adequately support end users to take decisions, investigate and seek inspiration, compare data, verify a research hypothesis, or just browse documents and learn something new.

Today, the tools that are available for similar purposes typically come from the two realms. In HCI, the most similar approach is Faceted Search, a strategy for accessing sets of information items based on a classification system [13]. In particular, information items are classified according to a number of “facets”, roughly corresponding to attributes, possibly hierarchically organized, thus enabling the user to look at the data from multiple viewpoints, often providing visual succinct descriptions of the information. Faceted search has been implemented in many systems but no emphasis has ever been given to the dimension of the datasets to be investigated, neither much has been done in the way of adopting sophisticated analytic techniques for adding more insight. However, the paradigm appears promising, especially from the viewpoint of guiding the user in the acquisition of knowledge by means of a walk through the information seen from different viewpoints.

On the other hand, in the DB research community, most data analytics proposals address professional data scientists, who, besides having knowledge of the domain of interest, must master mathematics, statistics, computer science, computer engineering, and programming. Besides, normally the analyst concentrates on one analysis method at a time, for example, adopting a certain data mining or statistical technique to solve a specific problem.

The aim of exploratory computing – that from our viewpoint becomes a genre of database exploration — is instead to assist mainly inexperienced or casual users, the so-called data enthusiasts [14], who need the support of a sophisticated system that guides them in the inspection, maybe starting from simple input queries, and possibly presenting suggestions along the way. The result is a reconnaissance path where, at each step, the system presents the users with relevant and concise facts about the data in order for them to understand better, and progress to the next exploration step. An example of such a relevant and concise fact might be that, while exploring historical medical data, it might come out that, in the result of the query Q1 “find all patients whose thyroid function tests are out of range”, the distribution of the age values is different from that of the original dataset. This is a relevant fact, maybe allowing the investigator to spot some relationship between age and thyroid disorders. After this step, the user might ask another query, like Q2 “find all patients whose thyroid function tests are out of range and who are over 60 years old”, to see whether the patients of that age group exhibit any special characteristics.

We may thus describe our exploration as the step-by-step conversation of a user with the system, where each step can refine the previous ones incrementally, gathering new knowledge that fulfills the user needs, sometimes even unveiling new ones.

In this kind of exploration, the notion of relevance is of paramount importance, since the main system task is to call the user attention to relevant (or surprising) differences or similarities between the datasets encountered along the route. The use of statistical summaries like distributions may highlight relevant aspects at a glance, permitting to see the data at a higher level of abstraction and decide to take one or more further actions. In fact, a distribution can describe a dataset in an approximate, intensional way. Note that a description is called intensional [20] when it exhibits the data properties instead of the data themselves.

Computing data distributions is one of many ways to assess relevance by looking at concise descriptions: we may add other measures, like entropy, which establishes the “level of variety” of a set of values, or describe the data by means of mined patterns like association rules [15], used to discover relations between attribute values, etc. For instance, in the result of query Q2, the system might discover that (i) “80% of the patients from Lombardia lived between the Second World War and the sixties in the area of Valtellina” and (ii) “70% of patients from Southern Italy are from a seaside location”.

In the same fashion as in Inductive DBs, these intensional descriptions can be stored to be queried when needed, or computed on the fly to obtain answers that are fast and concise, though potentially partial and approximate. Of course, computing them on the fly entails the need of fast computation techniques, which constitutes one of the big challenges for DB researchers.

Another feature that is badly needed in database exploration is the capability to provide explanations [16] and causal dependencies, whose aim is basically to understand the reasons for query results. Imagine that the user wants to understand “why there is a huge difference between Valtellina and the rest of Lombardia”: by joining the original dataset with one containing data about the Italian regions an explanation system might discover that Valtellina is very poor of iodine(*). The user will thus learn that iodine is strictly related to thyroid disorders, and possibly ask for another explanation, to see why such disorders are also found in seaside locations (rule (ii) extracted from the result of Q2). The explanation facility is thus another formidable support for a user who tries to grasp from the data more than a flat sequence of items. Also in this field, both logical and data mining techniques have been employed, adopting the mix of deduction and induction I have been advocating.

In fact, mixing the two approaches to support end users in gaining more wisdom takes infinite forms and nuances. Just to name some more, think about cooperative query answering [17], where users are supported in query formulation in order to get useful responses from the database; or query relaxation [18], where, in the presence of a query that returns an empty answer, the system suggests alternative, less restrictive queries that might provide a non-empty, still interesting answer; or the Watson system [19], developed in IBM’s DeepQA project, which gathers and analyzes resources on the web to answer questions posed in natural language.

As a conclusion, having continuously in mind the end user — or, if you want, the data enthusiast — will help us stick closer to the way people reason, and thus enrich our research with tools that produce wisdom in the sense I explained. However, while various interesting challenges posed by this view are actually common to other research areas, the real leapfrog must be provided by our community: efficacy for end users, and efficient storage systems and data-intensive computation are what we should always have in mind, and ultimately are called to provide.

(*) Actually, nowadays people from Valtellina cook using iodized salt.

[1] Sharon Fisher. Gartner: Forget Big Data, the Future is Algorithms. https://www.laserfiche.com/simplicity/gartner-forget-big-data-the-future-is-algorithms/

[2] ACM SIGMOD Blog. Archive for the Big Data category. http://wp.sigmod.org/?cat=11

[3] Timo Elliott. Business Analytics vs Business Intelligence? http://timoelliott.com/blog/2011/03/business-analytics-vs-business-intelligence.html

[4] ACM SIGMOD Blog. The Elephant In The Room: Getting Value From Big Data. http://wp.sigmod.org/?p=1519

[5] Luc De Raedt. A Perspective on Inductive Databases. SIGKDD Explorations 4(2): 69-77 (2002)

[6] Ehud Y. Shapiro. Inductive Inference of Theories from Facts. Computational Logic – Essays in Honor of Alan Robinson 1991: 199-254

[7] Stephen Muggleton, Luc De Raedt. Inductive Logic Programming: Theory and Methods. J. Log. Program. 19/20: 629-679 (1994)

[8] H.V. Jagadish, Adriane Chapman, Aaron Elkiss, Magesh Jayapandian, Yunyao Li, Arnab Nandi, and Cong Yu. Making Database Systems Usable. In SIGMOD, 2007.

[9] Wikipedia. Information overload. https://en.wikipedia.org/wiki/Information_overload

[10] Magdalini Eirinaki, Suju Abraham, Neoklis Polyzotis, Naushin Shaikh. QueRIE: Collaborative Database Exploration. IEEE Trans. Knowl. Data Eng. 26(7): 1778-1790 (2014)

[11] M. Buoncristiano, G. Mecca, E. Quintarelli, M. Roveri, D. Santoro, and L. Tanca. Database challenges for exploratory computing. SIGMOD Record, vol. 44, no. 2, pp. 17–22, 2015. Available at: http://doi.acm.org/10.1145/2814710.2814714

[12] Paolo Paolini, Nicoletta Di Blas. Exploratory portals: The need for a new generation. DSAA 2014: 581-586

[13] D. Tunkelang. Faceted Search (Synthesis Lectures on Information Concepts, Retrieval, and Services). Morgan and Claypool Publishers, 2009.

[14] K. Morton, M. Balazinska, D. Grossman, and J. D. Mackinlay. Support the data enthusiast: Challenges for next-generation data-analysis systems. PVLDB, 7(6):453–456, 2014.

[15] Mirjana Mazuran, Elisa Quintarelli, Letizia Tanca. Data Mining for XML Query-Answering Support. TKDE 24(8): 1393-1407 (2012).

[16] A. Meliou, S. Roy, and D. Suciu. Causality and explanations in databases (tutorial). PVLDB,7(13):1715{1716, 2014, available at http://www.vldb.org/pvldb/vol7/p1715-meliou.pdf

[17] Chu, W.W., Chen, Q. A structured approach for cooperative query answering. IEEE Transactions on Knowledge and Data Engineering 6(5), 738–749 (1994)

[18] Davide Mottin, Alice Marascu, Senjuti Basu Roy, Gautam Das, Themis Palpanas and Yannis Velegrakis. Query relaxation: A Probabilistic Optimization Framework for the Empty-Answer Problem. PVLDB 2013, Vol.6, No.14, available at http://www.vldb.org/pvldb/vol6/p1762-mottin.pdf

[19] IBM Watson: The Face of Watson https://www.youtube.com/watch?v=WIKM732oEek

[20] Alain Pirotte, Dominique Roelants, Esteban Zimanyi. Controlled Generation of Intensional Answers. TKDE 1991 3(2): 221-236.

Blogger Profile

Letizia Tanca is a full professor at Politecnico di Milano, where she has chaired the Computer Science Area of her department for the last 5 years. She received the PhD in Applied Mathematics and Computer Science in 1988. Currently, she teaches courses on Databases and Information System Technologies and is the author of about 150 publications on databases and database theory, deductive and active databases, graph-based languages, semantic-web information management, context-aware knowledge management and Big Data analytics. On these topics she has offered PhD courses, seminars and invited talks, notably a keynote at ACM SAC 2012. Letizia Tanca has been a referee for several top international journals, associate editor of PVLDB 2014 and a member of the program committee of a large number of international conferences, among which VLDB 2016 and SIGMOD 2017.
She represents her department at the Informatics Europe Association and in the Collaborative Innovation Center agreement between Politecnico and IBM, and is a member of the Steering Committee of the Informatics Europe Department Evaluation Initiative and of the expert pool of the EQANIE European Association. She has been the conference Chair of the ECSS 2011 (European Computer Science Summit of the Informatics Europe association) and has contributed to the Informatics Europe Report on Experimentation in Informatics.

Copyright @ 2016, Letizia Tanca, All rights reserved.

Related Posts