Figure 1 shows a portion of a relational table contained in a real, large information system. The table concerns the customers of an organization, where each row stores data about a single customer. The first column contains her code (if the code is negative, then the record refers to a special customer, called “fictitious”), columns 2 and 3 specify the time interval of validity for the record, ID_GROUP indicates the group the customer belongs to (if the value of FLAG_CP is “S”, then the customer is the leader of the group, and if FLAG_CF is “S”, then the customer is the controller of the group), FATTURATO is the annual turnover (but the value is valid only if FLAG_FATT is “S”). Obviously, each notion mentioned above (like “fictitious”, “group”, “leader”, etc.) has a specific meaning in the organization, and understanding such meaning is crucial if one wants to correctly manage the data in the table and extract information out of it. Similar rules hold for the other 47 columns that, for lack of space, are not shown in the figure.
Figure 1: A portion of the Customer table in a database of a large organization.
Those who have experience of large databases, or databases that are part of large information systems will not be surprised to see such complexity in a single data structure. Now, think of a database with many tables of this kind, and try to imagine a poor final user accessing such tables to extract useful information. The problem is even more severe if one considers that information systems in the real world use different (often many) heterogeneous data sources, both internal and external to the organization. Finally, if we add to the picture the (inevitable) need of dealing with big data, and consider in particular the two v’s of “volume” and “velocity”, we can easily understand why effectively accessing, integrating and managing data in complex organizations is still one of the main issues faced by IT industry nowadays.
Issues in governing complex information systems
The above is a simple example motivating the claim that governing the resources (data, meta-data, services, processes, etc.) of modern information systems is still an outstanding problem. I would like to go in more detail on three important aspects related to this issue.
Accessing and querying data. Although the initial design of a collection of data sources might be adequate, corrective maintenance actions tend to re-shape them into a form that often diverges from the original structure. Also, they are often subject to changes so as to adapt to specific, application-dependent needs. Analogously, applications are continuously modified for accommodating new requirements, and guaranteeing their seamless usage within the organization is costly. The result is that the data stored in different sources and the processes operating over them tend to be redundant, mutually inconsistent, and obscure for large classes of users. So, query formulation often requires interacting with IT experts who know where the data are and what they mean in the various contexts, and can therefore translate the information need expressed by the user into appropriate queries. It is not rare to see organizations where this process requires domain experts to send a request to the data management staff and wait for several days (or even weeks, at least in some Public Administrations in Italy…) before they receive a (possibly inappropriate) query in response. In summary, it is often exceedingly difficult for end users to single out exactly the data that are relevant for them, even though they are perfectly able to describe their requirement in terms of business concepts.
Data quality. It is often claimed that data quality is one of the most important factors in delivering high value information services. However, the above-mentioned scenario poses several obstacles to the goal of even checking data quality, let alone achieving a good level of quality in information delivery. How can we possibly specify data quality requirements, if we do not have a clear understanding of the semantics that data should bring? The problem is sharpened by the need of connecting to external data, originating, for example, from business partners, suppliers, clients, or even public sources. Again, judging about the quality of external data, and deciding whether to reconcile possible inconsistencies or simply adding such data as different views, cannot be done without a deep understanding of their meaning. Note that data quality is also crucial for opening data to external organizations. The demand of greater openness is irresistible nowadays. Private companies are pushed to open their resources to third parties, so as to favor collaborations and new business opportunities. In public administrations, opening up data is underpinning public service reforms in several countries, by offering people informed choices that simply have not existed before, thus driving towards improvements in the services to the citizens.
Process and service specification. Information systems are crucial artifacts for running organizations, and organizations rely not only on data, but also, for instance, on processes and services. Designing, documenting, managing, and executing processes is an important aspect of information systems. However, specifying what a process/service does, or which characteristics it is supposed to have, cannot be done correctly and comprehensively without a clear specification of which data the process will access, and how it will possibly change such data. The difficulties of doing that in a satisfactory way come from various factors, including the lack of modeling languages and tools for describing process and data holistically. However, the problems related to the semantics of data that we discussed above undoubtedly make the task even harder.
A (new?) paradigm
In the last five years, I (and my group in Roma) have been working on a new paradigm addressing these issues, based on the use of knowledge representation and reasoning techniques, and I want to share my excitement about it with the readers of this blog. The paradigm is called “Ontology-based Data Management” (OBDM), and requires structuring the information system into four layers.
The distinguishing feature of the whole approach is that users of the system will be freed from all the details of how to use the resources, as they will express their needs in the terms of the DKB. The system will reason about the DKB and the mappings, and will reformulate the needs in terms of appropriate calls to services provided by resources. Thus, for instance, a user query will be formulated over the domain ontology, and the system will reason upon the ontology and the mappings to call suitable queries over data sources that will compute the answers to the original user query.
As you can see, the heart of the approach is the DKB, and the core of the DKB is the ontology. So, what is new? Indeed, I can almost hear many of you saying: what is the difference with data integration (where the global schema plays the role of the ontology)? And what is the difference with conceptual modeling (where the conceptual schema plays the role of the ontology)? And what about Knowledge Representation in AI (where the axioms of the knowledge base play the role of the DKB)? The answer is simple: almost none. Indeed, OBDA builds on all the above disciplines (and others), but with the goal of going beyond what they currently provide for solving the problems that people encounter in the governance of complex information systems. At the same time, there are a few (crucial, at least for me) facts that make OBDA a novel paradigm to experiment and study. Here is a list of the most important ones:
A lot more to do
A few research groups are experimenting OBDM in practice (see, for example, the Optique IP project, financed by the Seventh Framework Program (FP7) of the European Commission). In Rome, we are involved in applied projects both with Public Administrations, and with private companies. One of the experiences we are carrying out is with the Department of Treasury of the Italian Ministry of Economy and Finance. In this project, three ontology experts from our department worked with three domain experts for six months, and built an ontology of 800 elements, with 3000 DL-Lite axioms, and 800 mapping assertions to about 80 relational tables. The ontology is now used as a common framework for all the applications, and will constitute the main document specifying the requirement for the restructuring of the information system that will be carried out in the next future. We are actually lucky to live in Rome, not only because it is a magnificent city, but also because Italian Public Administrations, many of which are located in the Eternal City, provide perfect examples of all the problems that make OBDM interesting and potentially useful…
The first experiences we have conducted are very promising, but OBDM is a young paradigm, and therefore it needs attention and care. This means that there are many issues to be addressed to make it really effectively work in practice. Let me briefly illustrate some of them. One big issue is how to build and maintain the ontology (and, more generally, the DKB). I know that this is one of the most important criticisms to all the approaches requiring a considerable modeling effort. My answer to these is that all modeling efforts are investments, and when we judge about investments we should talk not only about costs, but also about benefits. Also, take into account that OBDM works in a “pay-as-you-go” fashion: users have interesting advantages even with a very incomplete domain description, as the system can reason about an incomplete specification, and try to get the best out of it. Another important issue is evolution. Evolution in OBDM concerns not only the data at the sources (updates), but also the ontology, and the mappings. Indeed, both the domain description, and the resources continue to evolve, and all the components of the system should keep up with these modifications. Not surprisingly, this is one issue where more research is still desperately needed. Overall, the DKB and the mappings constitute the meta-data of the OBDM system, and in complex organizations, such meta-data can be huge and difficult to control and organize. Talking in terms of a fashionable terminology, with OBDM we face not only the problem of Big Data, but also the problem of Big Meta-Data. Another issue that needs to be further studied and explored is the relationship between the static aspects and the dynamic aspects of the DKB, together with the problem of mapping processes and services specified at the conceptual level to computational resources in applications.
I really hope that this blog have somehow triggered your attention to OBDM, and that you will consider looking at it more closely, for example for carrying out some experiments, or for doing research on at least some of its many open problems that still remain to be studied.
| Blogger’s Profile:
Maurizio Lenzerini is a full professor in Computer Science and Engineering at the University of Rome La Sapienza, where he is leading a research group on Databases and Artificial Intelligence. His main research interests are in database theory, data and service and integration, ontology languages, knowledge representation and reasoning, and component-based software development. He is a former Chair and a current member of the Executive Committee of ACM PODS (Principles of Database Systems). He is an ACM Fellow, an ECAI (European Coordinating Committee for Artificial Intelligence) fellow, and a member of the Academia Europaea – The Academy of Europe.