November 14, 2018
The recent return of AI summer and the enthusiastic uptake of AI in the commercial world can be loosely attributed to three innovations: Apple’s Siri, Google’s self-driving cars, and IBM Watson Jeopardy. This enthusiasm stems from the belief that AI will influence a wide range of applications across multiple industry segments. While such enthusiasm is, partially, justified with the current attempts around conversational systems, self-driving cars and facial recognition, there has been less progress made in disrupting core enterprise applications. Industry-specific applications such as Compliance for Finance & Insurance and Fraud Detection in Health Insurance Billing as well as cross-industry applications such as Enterprise Resource Planning (ERP) do have the potential to be greatly influenced by AI. These applications are traditionally built using structured data, the usage and management of which, as revolutionized by the introduction of Relational Databases by Ted Codd , is well-understood. However, the effort to bring AI into an enterprise is more than just Machine Learning and Deep learning. Instead, enterprise AI requires a sophisticated inter-twining of process, transparent AI technologies along with user feedback powering model adaptation, and all this wrapped in appropriate governance. While such a lofty endeavor will have many applications, here we will describe these considerations in the context of one area that we believe is ripe for AI in enterprises: content understanding of unstructured text, or content understanding for short.
A significant role that AI can play in enterprises is to tame unstructured content in the context of new and traditional enterprise applications. While content management systems do exist in enterprises, the seamless integration of detailed information from unstructured content into traditional enterprise applications and business processes is still a major gap. As an illustrative example, contracts play a significant role in both buy-side and sell-side business processes in an enterprise. On the procurement (buy-side) of an enterprise, alone, contracts are central to virtually all processes ranging from risk mitigation in early contract termination to negotiation of payment terms. Combined with invoices, purchase orders, and Request for Proposals (RFPs) etc., unstructured content is central to enterprise processes, while unfortunately, it results in a large portion of all such processes having a significant manual component as such.
Enterprise content understanding requires a great level of sophistication. For instance, a business contract can consist of hundreds and thousands of sentences, each defining one or more clauses (e.g. obligation, exclusion, right, requirement, etc.). To understand a contract, every sentence needs to be parsed and understood in context. Consider the following sentence from a business contract as an example.
Understanding the above sentence requires more than the simple knowledge of the English language, but some experience in corporate law. However, with careful guidance on what to ignore, the sentence above becomes more easily understandable. As depicted below, ignoring the part of the sentence after the comma, it is relatively straightforward, even for non-lawyers, to understand that the sentence excludes the Supplier from having to do something. Indeed, an SME marked this sentence as “Exclusion for the Supplier”.
Building content understanding models for such complex cases is nontrivial. Even for expert NLP practitioners, understanding a complex sentence, such as the one illustrated above, and ignoring the irrelevant portion is already a sophisticated task. To make content understanding models useful in Enterprise applications, they need to be able to parse millions of enterprise documents (e.g., Contracts, Policy, Invoices, RFPs) and determine the complex relationship among them. Furthermore, these models need to be customizable to enable enterprise applications to accommodate idiosyncrasies of individual companies.
Consider the following two sentences.
Company A may deal with Trademark violations differently from Intellectual Property (IP) rights and thus prefer the first sentence labeled as a Trademark Clause and the second one as an IP Clause. Company B, on the other hand, might treat them the same and prefer both labeled as IP Clauses, even though technically speaking, Trademark is a specific type of Intellectual Property.
Finally, models built to understand such unstructured content are never static; new contracts, new invoices, and new RFPs etc. necessitate that models are built and continuously maintained with a complex life cycle to ensure accuracy and scalability.
In this post, we discuss the complexity of content understanding for AI in the real-world based on our experiences in building a few Watson AI services based on unstructured content and highlight the challenges during this AI lifecycle.
As depicted below, the lifecycle of a real-world AI application built based on content understanding consists of the following three stages:
Content Acquisition identifies and acquires the raw source data used to build and evaluate models for the AI application. Content Acquisition is the first critical step in this AI lifecycle with both technical and non-technical implications.
Content in the enterprise is highly heterogeneous in nature.; e.g., IBM sells products and services to a large number of enterprises while acquiring products and services from multiple suppliers and vendors. As such, it has to deal with many different types (and variations within) of governing documents. The training of a good model, consequently, needs representative samples from such a wide variety of content. It is unreasonable to believe that such wide representative samples can be obtained all at once. Therefore, Content Acquisition is not a single-shot effort but rather a continuous process.
Furthermore, Content Acquisition is much more than solving difficult technical problems. Legal, business, ethical, and sometimes social aspects are important considerations during the process of acquiring content. For instance, the builder of the AI application is often not the owner of the raw source data. As such, legal clearance may often be required to obtain the data. In addition, the license associated with the data could limit how the data is used (e.g. only for testing vs. model training) as well as whether and how long the data can be stored. The complexity of Content Acquisition warrants a separate discussion and will not be elaborated further in this post.
The scope of building models for Content Understanding expands the conventional notion of model building far beyond the standard definitions of training, evaluation, hyperparameter tuning, and prediction. Instead, it is a sequence of multiple steps, and each step can be a sequence of steps all of which are wrapped in appropriate governance. These sequences of steps, hierarchically arranged, wrapped in governance is referred to as the Data Science Process. Broadly speaking, the outer set of steps in content understanding are (A) Ontology Building, (B) Labeled Data Creation, (C) SME Knowledge Acquisition, (D) Model Development, and (E) Evaluation. We now describe each of them in more details.
The problem of Content Understanding starts with the definition of an Ontology – i.e., the schema (here, we will use the terms Schema and Ontology interchangeably with apologies to the purist) that defines what is of interest in the underlying content collection. For each domain, such a schema is created by a committee of Subject Matter Experts (SMEs). In this blog we will use procurement contracts as the domain from which all examples will be cited. In practice, it is often the case, that such an Ontology may go through multiple revisions before stabilizing. To ensure that there is common understanding of the concepts in the Ontology, it is advisable to maintain written definitions of each concept in the Ontology. Further, it is expected that more concepts will be added and the definitions of existing concepts may evolve over time. For instance, this document provides written definitions for concepts related to contract parsing in IBM Watson and is updated as the definitions evolve over time.
Labeled data is used for two purposes: (a) Model Training and (b) Model Evaluation. Inconsistency in labeling, either due to lack of clear definition, or different interpretations of the same concept across labelers, can introduce significant noise into model building. It is critical therefore to understand and resolve such label-conflict. Obtaining labeled data for enterprise applications can be expensive; thus, trying to eliminate labeling noise by having many labelers may be abortive.
The following process, involving two labelers, can be an inexpensive and effective method to mitigate the label-conflict problem. For each document to be labeled, (a) Assign the document at random to a labeler from a pool of labelers; (b) Give the labeled document from the above step to a different labeler who either agrees with the label of the first labeler or suggests an alternate label. One can then build a confusion matrix from the pool of documents, where each entry in the confusion matrix is the conditional probability that the second labeler will agree with the label provided by the first labeler.
An example confusion matrix from a collection of enterprise contracts labeled by two SMEs is depicted below. A quick glance at the diagonal elements reveals that there is less than perfect agreement between the SMEs. Further, such a matrix can be used to identify those portions of the Ontology that are the most confused (e.g., labelers often confuse Obligation Clauses for Supplier with Exclusion Clause for Suppliers). This information can be used to determine whether (a) These concepts should be merged; and (b) Definitions/annotation guidelines for those concepts need to be revisited etc.
Conventional wisdom says that the greater the amount of labeled data used in training the better is the model. This conventional wisdom, however, comes under the severe assumption that large amounts of labeled data can be obtained cheaply but reality is far from that. Ontological concepts in sophisticated enterprise applications require SMEs with extensive, and thus expensive, training, and obtaining very large amounts of labeled data may be abortive. Therefore, for a fixed budget of labeled data, there is a balance to be struck between how much is used in model training versus evaluation. Using a large amount of the available labeled data for training may starve model evaluation thereby leading to under-tested and, possibly, unreliable models.
A practical and very effective approach is to incorporate the knowledge from SMEs into the model either, directly, as a Rule or, indirectly, into the structure of the model. When captured as a Rule it is critical to maintain two considerations: (A) Rule is at a sufficient level of abstraction such that it generalizes well (see figure below for an example) and (B) Rules are expressed in an appropriate language for expansion and maintainability [2,3]. Certainly, knowledge-based models can be augmented with appropriate statistical models from training data. Incorporating knowledge into the structure of models is an emerging research topic in the general AI area (e.g. [1,5]).
Model development is a relatively straightforward step, where models are built by data scientists via the standard procedure of train/test, k-fold cross-validation, and/or Monte-Carlo techniques with training data. A key challenge is that the models often need to be built initially with small amount of labeled data, and then gradually improved with error analysis, as more training data and feedback obtained after deployment. As such, continuous model improvement and learning is essential in the AI life cycle as we will discuss later in this post.
Once the model has been built, the next step is to evaluate this model. Note that, typically, there are multiple competing models (with different parameters) built during model development. For each model, both the quality (e.g. Precision/Recall/F1-measure) as well as runtime efficiency (e.g. CPU/Memory usage) needs to be evaluated.
To ensure meaningful quality metrics (e.g. Precision/Recall/F1-measure), the definition of correctness of individual concepts in the Ontology has to be well- and carefully-defined. Consider the following example sentence from a Contract.
As can be seen, this sentence contains both a Privacy Clause (as indicated by “promptly return or erase all Personal Data stored in its internal systems”) as well as a Term & Termination Clause (as indicated by “On termination of this Agreement”). In such a case, how correct is a model that identifies only one of the clauses? Or, is there an order in the clauses, to be identified, in each sentence? The importance of such details is critical for the success of AI in an enterprise. Any evaluation metric defined needs to take such details into consideration.
The results of evaluation are used to determine the model to be deployed in production and there needs to be appropriate governance around it. Overall quality of the model as well as runtime performance characteristics are two important factors affecting the decision around model deployment. However, a third, critical component, which is often overlooked, is the consistency/stability of the model over time. Enterprise applications use these production models in their continuous processes. As discussed, models evolve over time due to availability of new data, corrections to already labeled data, modified definitions of concepts etc. For enterprise applications to, seamlessly, start using newer models, there has to be certain stability guarantees/requirements. Lack of appropriate consistency/stability in models, over time, can be a surprise and affect the usage of AI models in mission-critical AI applications. We explain this more in the next section along with some examples.
The Data Science Process described above builds and publishes a preliminary model into production. However, as emphasized before, models are not static entities and they need to continue to adapt and improve. Such improvement is dictated by two major inputs:
This phase is very similar to handling labeled data with one important difference. While updating a model in production (which will likely, already, be in use by enterprise customers), it is imperative that updated models satisfy certain consistency requirements (described more in Error Analysis, below).
Explicit feedback collection is the step in which the application actively solicits feedback from the users. This feedback is for the application semantics and consequently, directly, involves the Subject Matter Experts (SMEs). In our contract example, the feedback will be about the concepts from the Ontology that the model has labeled each sentence with. The SME can mark off model mistakes and indicate the correct label, potentially along with comments to provide explanation on the feedback using a visual tool similar to the screenshots below.
The tool can also suggest additional instances similar to the mistakes to help SME to provide additional feedback. Such feedback is then analyzed (Error Analysis) and incorporated appropriately (Feedback Incorporation) to alter the underlying models.
Enterprise applications require a certain level of guarantee from the underlying processes to ensure continuous operation. One such desired guarantee is that none of the correct instances of a previous model get disturbed by a newer version of the model. In simple terms, the newer versions of a model only correct mistakes of previous models. We refer to this as “Instance-Level Consistency” to capture the fact that conventional Precision/Recall and F1 measures are Collection-Level statistics. Enterprise AI needs to strive to provide such consistency and this burden lies in the data-science process and a large component of that is the error analysis. To enable such governance, data scientists need to understand and classify errors of the current model so that only the appropriate parts of the current model are updated.
Below is a screenshot of the tool used in IBM Watson for model error analysis in enterprise document understanding. It visually displays the different types of errors and also allows a data scientist to classify model mistakes into one of the different types of errors. As can be seen,
the tool also enables the drilling down of an error to the location in the document so that the root cause can be determined. Such a tool can significantly improve the productivity of the data scientists.
Model errors, broadly, fall into two major categories: (a) True Model Errors – where the mistake lies, truly with the model and (b) SME Label Errors – where the mistake is due to mislabeling by the SME. Differentiating between these two errors is critical to ensure that the Model does not get unfairly penalized for errors due to SME negligence or confusion. True model errors, also, need to be classified further to ensure that only appropriate parts of the model are corrected. Watson’s error analysis tool, discussed above, enables the data-scientists to classify the errors into a customized set of error categories. As shown in the previous screenshots, these can be precision errors, which themselves are further classified into List Handling Error, Semantic Parser Error, and so on.
The categorized true model errors and SME label errors are then used to either improve the model or in the case of SME errors take appropriate action.
Below is a list of common actions taken to incorporate different types of feedback determined by Error Analysis.
True Model Errors
(1) Recall Errors: These are due to lack of labeled data and the appropriate action is to obtain more labeled data associated with the Recall errors.
(2) Precision Errors: The detailed characterization of the precision errors (as described above) should provide adequate input/corrected labeled data to model building to appropriately correct those parts of the model. It cannot be emphasized enough that this corrected model does adhere, as close as possible, to the Instance-level Consistency, described above.
SME Errors, in turn, can be due to two reasons: (a) Definitional Disagreement (b) Negligence. Implications of SME disagreement and the need for explicitly written definitions of each concept in the Ontology have already been discussed above.
In this article, we present the complexity in Content Understanding for building enterprise-level AI applications based on our experience in building several IBM Watson AI services. We discuss the associated challenges throughout the lifecycle of an AI application in the real-world. Most of the challenges center around data and knowledge, offering opportunities to our data management community. This post is not the first one discussing data management challenges associated with AI / machine learning. Multiple recent SIGMOD blogs and tutorials such as  cover data management challenges related to machine learning in the real-world. We hope that our focus on Content Understanding helps bring in new perspectives on the role of data management systems in supporting AI for the enterprise.
 Relational inductive biases, deep learning, and graph networks. Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, Razvan Pascanu. arXiv:1806.01261 [cs.LG] 2018
 SystemT: an algebraic approach to declarative information extraction. Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick R Reiss, Shivakumar Vaithyanathan. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010
 Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems! Laura Chiticariu, Yunyao Li and Frederick Reiss. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013
 A relational model of data for large shared data banks. E. F. Codd. Commun. ACM 13, 6 (June 1970), 377-387.
 Learning Explanatory Rules from Noisy Data. Richard Evans, Edward Grefenstette. ArXiv:1711.04574 [cs.NE] 2017
 Data Management Challenges in Production Machine Learning. Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. In Proceedings of the 2017 ACM International Conference on Management of Data , 2017
Yunyao Li is a Senior Research Manager with IBM Research – Almaden, where she manages the Scalable Knowledge Intelligence department. She is also a Master Inventor, a member of IBM Academy of Technology, and a member of the New Voices program of the National Academies. Her expertise is in the interdisciplinary areas of natural language processing, databases, human-computer interaction, machine learning and information retrieval. machine learning and information retrieval. Her contributions in these areas have led to over 50 research publications, more than 20 patents granted or filed, multiple graduate-level courses (including 2 Massive Open Online Courses), and billions of revenue generated from technology transfer. Her current research interest focuses on taming unstructured and semi-structured content to enable the building of new generation of AI applications for the enterprise.
Shivakumar (Shiv) Vaithyanathan is an IBM Fellow in IBM Watson responsible for building large-scale Document Understanding systems. Shiv is responsible for setting architectural directions, day-to-day development activities as well as coordinating with IBM Research across the world. Prior to that Shiv founded and managed the NLP & Machine Learning department in IBM Almaden which has been responsible for significant technology transfer to IBM SWG as well as open sourcing code (SystemML). Shiv has authored or co-authored more than 30 papers and has been granted multiple patents.
Copyright @ 2018, Yunyao Li and Shivakumar Vaithyanathan, All rights reserved.