May 14, 2013
Figure 1 shows a portion of a relational table contained in a real, large information system. The table concerns the customers of an organization, where each row stores data about a single customer. The first column contains her code (if the code is negative, then the record refers to a special customer, called “fictitious”), columns 2 and 3 specify the time interval of validity for the record, ID_GROUP indicates the group the customer belongs to (if the value of FLAG_CP is “S”, then the customer is the leader of the group, and if FLAG_CF is “S”, then the customer is the controller of the group), FATTURATO is the annual turnover (but the value is valid only if FLAG_FATT is “S”). Obviously, each notion mentioned above (like “fictitious”, “group”, “leader”, etc.) has a specific meaning in the organization, and understanding such meaning is crucial if one wants to correctly manage the data in the table and extract information out of it. Similar rules hold for the other 47 columns that, for lack of space, are not shown in the figure.
Those who have experience of large databases, or databases that are part of large information systems will not be surprised to see such complexity in a single data structure. Now, think of a database with many tables of this kind, and try to imagine a poor final user accessing such tables to extract useful information. The problem is even more severe if one considers that information systems in the real world use different (often many) heterogeneous data sources, both internal and external to the organization. Finally, if we add to the picture the (inevitable) need of dealing with big data, and consider in particular the two v’s of “volume” and “velocity”, we can easily understand why effectively accessing, integrating and managing data in complex organizations is still one of the main issues faced by IT industry nowadays.
The above is a simple example motivating the claim that governing the resources (data, meta-data, services, processes, etc.) of modern information systems is still an outstanding problem. I would like to go in more detail on three important aspects related to this issue.
Accessing and querying data. Although the initial design of a collection of data sources might be adequate, corrective maintenance actions tend to re-shape them into a form that often diverges from the original structure. Also, they are often subject to changes so as to adapt to specific, application-dependent needs. Analogously, applications are continuously modified for accommodating new requirements, and guaranteeing their seamless usage within the organization is costly. The result is that the data stored in different sources and the processes operating over them tend to be redundant, mutually inconsistent, and obscure for large classes of users. So, query formulation often requires interacting with IT experts who know where the data are and what they mean in the various contexts, and can therefore translate the information need expressed by the user into appropriate queries. It is not rare to see organizations where this process requires domain experts to send a request to the data management staff and wait for several days (or even weeks, at least in some Public Administrations in Italy…) before they receive a (possibly inappropriate) query in response. In summary, it is often exceedingly difficult for end users to single out exactly the data that are relevant for them, even though they are perfectly able to describe their requirement in terms of business concepts.
Data quality. It is often claimed that data quality is one of the most important factors in delivering high value information services. However, the above-mentioned scenario poses several obstacles to the goal of even checking data quality, let alone achieving a good level of quality in information delivery. How can we possibly specify data quality requirements, if we do not have a clear understanding of the semantics that data should bring? The problem is sharpened by the need of connecting to external data, originating, for example, from business partners, suppliers, clients, or even public sources. Again, judging about the quality of external data, and deciding whether to reconcile possible inconsistencies or simply adding such data as different views, cannot be done without a deep understanding of their meaning. Note that data quality is also crucial for opening data to external organizations. The demand of greater openness is irresistible nowadays. Private companies are pushed to open their resources to third parties, so as to favor collaborations and new business opportunities. In public administrations, opening up data is underpinning public service reforms in several countries, by offering people informed choices that simply have not existed before, thus driving towards improvements in the services to the citizens.
Process and service specification. Information systems are crucial artifacts for running organizations, and organizations rely not only on data, but also, for instance, on processes and services. Designing, documenting, managing, and executing processes is an important aspect of information systems. However, specifying what a process/service does, or which characteristics it is supposed to have, cannot be done correctly and comprehensively without a clear specification of which data the process will access, and how it will possibly change such data. The difficulties of doing that in a satisfactory way come from various factors, including the lack of modeling languages and tools for describing process and data holistically. However, the problems related to the semantics of data that we discussed above undoubtedly make the task even harder.
In the last five years, I (and my group in Roma) have been working on a new paradigm addressing these issues, based on the use of knowledge representation and reasoning techniques, and I want to share my excitement about it with the readers of this blog. The paradigm is called “Ontology-based Data Management” (OBDM), and requires structuring the information system into four layers.
The resource layer is constituted by the existing data sources and applications that are relevant for the organization.
The knowledge layer is constituted by a declarative and explicit representation of the whole domain of interest for the organization, called the domain knowledge base (DKB). The domain is specified by means of a formal and high level description of both its static and dynamic aspects, structured into four components: (i) the ontology, formally describing the information model of the organization and its basic usage primitives, (ii) the specification of atomic operations, representing meaningful and relevant basic actions in the domain, (iii) the specification of operating patterns, describing the sequencing of atomic operations that are considered correct in the various contexts of the organization, and (iv) the processes, where each process is a structured collection of activities producing a specific service or product within the organization.
The mapping layer is a set of declarative assertions specifying how the available resources map to the DKB.
The view layer specifies views over the knowledge layer, both to be provided to internal applications, and to be exposed as open data and open APIs to third parties.
The distinguishing feature of the whole approach is that users of the system will be freed from all the details of how to use the resources, as they will express their needs in the terms of the DKB. The system will reason about the DKB and the mappings, and will reformulate the needs in terms of appropriate calls to services provided by resources. Thus, for instance, a user query will be formulated over the domain ontology, and the system will reason upon the ontology and the mappings to call suitable queries over data sources that will compute the answers to the original user query.
As you can see, the heart of the approach is the DKB, and the core of the DKB is the ontology. So, what is new? Indeed, I can almost hear many of you saying: what is the difference with data integration (where the global schema plays the role of the ontology)? And what is the difference with conceptual modeling (where the conceptual schema plays the role of the ontology)? And what about Knowledge Representation in AI (where the axioms of the knowledge base play the role of the DKB)? The answer is simple: almost none. Indeed, OBDA builds on all the above disciplines (and others), but with the goal of going beyond what they currently provide for solving the problems that people encounter in the governance of complex information systems. At the same time, there are a few (crucial, at least for me) facts that make OBDA a novel paradigm to experiment and study. Here is a list of the most important ones:
1. While models in both Conceptual Modeling and Model-Driven Architectures are essentially design-time artifacts that are compiled into databases and software modules once the application design is done, the ontology in OBDM (and more generally the DKB) is a run-time object that is not compiled, but interpreted directly. It is envisioned as the brain of the new functioning of the information system of the organization. This is made possible by the recent advances of the research in automated reasoning, enabling run-time virtualization, which is the basis of most of the techniques needed for an OBDM system. A notable example of how automated reasoning contributes to OBDM is the plethora of rewriting techniques that have been studied recently for making query answering efficient and practical in OBDM.
2. While data integration is generally a “read-only” task, in OBDM the user will express not only queries, but also updates over the ontology (and even processes, since updates will be part of atomic operations and processes), and the mappings will be used to reformulate such updates over the data sources, thus realizing a “write-also” data integration paradigm. Also, mappings in data integration relate data sources to a global schema, whereas in OBDM mappings are used to specify the relationships between all the resources, including services and applications, and the elements of the DKB.
3. While Knowledge Representation and Reasoning techniques are often confined to scenarios where the complexity resides in the rules governing the applications, in OBDM one faces the problem a huge amount of data in the resource layer, and this poses completely new requirements for the reasoning tasks that the system should be able to carry out (for example, the notion of data complexity is of paramount importance in OBDM).
A few research groups are experimenting OBDM in practice (see, for example, the Optique IP project, financed by the Seventh Framework Program (FP7) of the European Commission). In Rome, we are involved in applied projects both with Public Administrations, and with private companies. One of the experiences we are carrying out is with the Department of Treasury of the Italian Ministry of Economy and Finance. In this project, three ontology experts from our department worked with three domain experts for six months, and built an ontology of 800 elements, with 3000 DL-Lite axioms, and 800 mapping assertions to about 80 relational tables. The ontology is now used as a common framework for all the applications, and will constitute the main document specifying the requirement for the restructuring of the information system that will be carried out in the next future. We are actually lucky to live in Rome, not only because it is a magnificent city, but also because Italian Public Administrations, many of which are located in the Eternal City, provide perfect examples of all the problems that make OBDM interesting and potentially useful…
The first experiences we have conducted are very promising, but OBDM is a young paradigm, and therefore it needs attention and care. This means that there are many issues to be addressed to make it really effectively work in practice. Let me briefly illustrate some of them. One big issue is how to build and maintain the ontology (and, more generally, the DKB). I know that this is one of the most important criticisms to all the approaches requiring a considerable modeling effort. My answer to these is that all modeling efforts are investments, and when we judge about investments we should talk not only about costs, but also about benefits. Also, take into account that OBDM works in a “pay-as-you-go” fashion: users have interesting advantages even with a very incomplete domain description, as the system can reason about an incomplete specification, and try to get the best out of it. Another important issue is evolution. Evolution in OBDM concerns not only the data at the sources (updates), but also the ontology, and the mappings. Indeed, both the domain description, and the resources continue to evolve, and all the components of the system should keep up with these modifications. Not surprisingly, this is one issue where more research is still desperately needed. Overall, the DKB and the mappings constitute the meta-data of the OBDM system, and in complex organizations, such meta-data can be huge and difficult to control and organize. Talking in terms of a fashionable terminology, with OBDM we face not only the problem of Big Data, but also the problem of Big Meta-Data. Another issue that needs to be further studied and explored is the relationship between the static aspects and the dynamic aspects of the DKB, together with the problem of mapping processes and services specified at the conceptual level to computational resources in applications.
I really hope that this blog have somehow triggered your attention to OBDM, and that you will consider looking at it more closely, for example for carrying out some experiments, or for doing research on at least some of its many open problems that still remain to be studied.
Maurizio Lenzerini is a full professor in Computer Science and Engineering at the University of Rome La Sapienza, where he is leading a research group on Databases and Artificial Intelligence. His main research interests are in database theory, data and service and integration, ontology languages, knowledge representation and reasoning, and component-based software development. He is a former Chair and a current member of the Executive Committee of ACM PODS (Principles of Database Systems). He is an ACM Fellow, an ECAI (European Coordinating Committee for Artificial Intelligence) fellow, and a member of the Academia Europaea – The Academy of Europe.
Copyright @ 2013, Maurizio Lenzerini, All rights reserved.
Comments are closed
first of all, thanks for bringing this issue to the forefront. I think it’s getting clear that dealing with data integration and management in a proper way will require that the problem of specifying and managing data semantics be properly addressed. In my opinion, this problem is mostly shunned by the DB research community. Semantic issues tend to be seen as ‘soft’ and difficult to publish. Very little research (again, in my opinion) goes into metadata and related issues. I hope your post will, as you say, bring some attention to this issue.
I also wanted to thank you for posting a real-life example of this issue. The textbook examples normally used do not do justice to the ‘wickedness’ of the problem. I wish we had more examples like this to motivate students and to check proposed solutions against.
Once all this said, I’m highly skeptical of ontology-based solutions. You bring up several issues that require attention, but they are all “developmental” challenges, i.e. once the paradigm (ontology-based solution) is accepted, one can start looking for technical solutions to them. I’d like to know your opinion on “foundational” challenges, i.e. those that question whether ontology-based solutions are a good way to go or maybe alternatives should be considered. In particular,
-most ontologies rely on a logic-based language (usually, of the DL family). There’s a well-known trade-off between expressibility and complexity here: when we use a language in which we can express what we want/need, we may find reasoning is undecidable. When we use a language where reasoning is feasible, we may not be able to express what we want/need. To be concrete, the issue of negation (for instance, distinguishing between what we know is not the case, and what we don’t know whether it is the case) does not get the attention it deserves (again, my opinion). And, of course, going to data-complexity (or data+query) instead of query-complexity will only make things worse.
-most logic-based language do not deal well with ambiguity and context dependence. But one can claim that ambiguity and context dependence are *inherent* to any good description of a complex domain: not something to be *eliminated*, because then what is left is not a good model, but something to be dealt with.
-ontologies are static: you mention explicitly the need to deal with change, so you’re aware of this problem. But I don’t think that there’s agreement on how to represent change -at least in the DL community there’s some basic agreement on how to build static models.
-finally, formal, logic models aim for a degree of “completeness”. Since specifying all potential uses of data in advance is not realistic, it seems clear that the only way forward for metadata management is to use OWA semantics. One needs to deal with what happens if one goes ‘over the edge’ of the specification. In relation to this, I wonder if you’re aware of the research in Emergent Semantics that other researchers are proposing (I assume you are, since some proponents are colleagues of yours), and that would seem like a completely different way to go about this problem.
Thanks again for your interesting post and for any comments you care to add.
Thank you Antonio for your very relevant observations. My intention in the blog post was not to refer to a particular solution (e.g., the DL-Lite family) to the problem. And indeed, I tried to be very general, and only convey the basic idea to build a system where the management of data is done through a conceptual model (or, a Domain Knowledge Base), specified declaratively. Also, I intentionally did not refer to formal logic models. However, what I think is crucial is to build systems on precise, sound, and formal bases, and therefore with formal semantics (not necessarily “logic-centered”). Also, I really liked your comment on the need for negation, and for the distinction between what is true in the domain and what the system knows about the domain. We are actually addressing this issue in the work on our ODBM system – if you are interested, I can send you references). On the other hand, I am not sure that OBDM will die because of the limitation of expressiveness of the modeling languages. All languages have limitations (e.g., SQL cannot express recursion), but this does not mean that they are not useful. Having a solid basis (although somewhat limited in expressiveness) where reasoning can be done does not prevent you to build more expressive functions on top of it, where computing, instead of reasoning, plays the main role. Finally, I completely agree on the need of more research on several issues, the most prominent probably being the evolution of the ontology (conceptual model); and indeed I mentioned this in the post as one issue where more research is still desperately needed. After said that, I believe that OBDM is very promising in several respect, as the first real-world experimentations show.