Figure 1 shows a portion of a relational table contained in a real, large information system. The table concerns the customers of an organization, where each row stores data about a single customer. The first column contains her code (if the code is negative, then the record refers to a special customer, called “fictitious”), columns 2 and 3 specify the time interval of validity for the record, ID_GROUP indicates the group the customer belongs to (if the value of FLAG_CP is “S”, then the customer is the leader of the group, and if FLAG_CF is “S”, then the customer is the controller of the group), FATTURATO is the annual turnover (but the value is valid only if FLAG_FATT is “S”). Obviously, each notion mentioned above (like “fictitious”, “group”, “leader”, etc.) has a specific meaning in the organization, and understanding such meaning is crucial if one wants to correctly manage the data in the table and extract information out of it. Similar rules hold for the other 47 columns that, for lack of space, are not shown in the figure.
Figure 1: A portion of the Customer table in a database of a large organization.
Those who have experience of large databases, or databases that are part of large information systems will not be surprised to see such complexity in a single data structure. Now, think of a database with many tables of this kind, and try to imagine a poor final user accessing such tables to extract useful information. The problem is even more severe if one considers that information systems in the real world use different (often many) heterogeneous data sources, both internal and external to the organization. Finally, if we add to the picture the (inevitable) need of dealing with big data, and consider in particular the two v’s of “volume” and “velocity”, we can easily understand why effectively accessing, integrating and managing data in complex organizations is still one of the main issues faced by IT industry nowadays.
Issues in governing complex information systems
The above is a simple example motivating the claim that governing the resources (data, meta-data, services, processes, etc.) of modern information systems is still an outstanding problem. I would like to go in more detail on three important aspects related to this issue.
Accessing and querying data. Although the initial design of a collection of data sources might be adequate, corrective maintenance actions tend to re-shape them into a form that often diverges from the original structure. Also, they are often subject to changes so as to adapt to specific, application-dependent needs. Analogously, applications are continuously modified for accommodating new requirements, and guaranteeing their seamless usage within the organization is costly. The result is that the data stored in different sources and the processes operating over them tend to be redundant, mutually inconsistent, and obscure for large classes of users. So, query formulation often requires interacting with IT experts who know where the data are and what they mean in the various contexts, and can therefore translate the information need expressed by the user into appropriate queries. It is not rare to see organizations where this process requires domain experts to send a request to the data management staff and wait for several days (or even weeks, at least in some Public Administrations in Italy…) before they receive a (possibly inappropriate) query in response. In summary, it is often exceedingly difficult for end users to single out exactly the data that are relevant for them, even though they are perfectly able to describe their requirement in terms of business concepts.
Data quality. It is often claimed that data quality is one of the most important factors in delivering high value information services. However, the above-mentioned scenario poses several obstacles to the goal of even checking data quality, let alone achieving a good level of quality in information delivery. How can we possibly specify data quality requirements, if we do not have a clear understanding of the semantics that data should bring? The problem is sharpened by the need of connecting to external data, originating, for example, from business partners, suppliers, clients, or even public sources. Again, judging about the quality of external data, and deciding whether to reconcile possible inconsistencies or simply adding such data as different views, cannot be done without a deep understanding of their meaning. Note that data quality is also crucial for opening data to external organizations. The demand of greater openness is irresistible nowadays. Private companies are pushed to open their resources to third parties, so as to favor collaborations and new business opportunities. In public administrations, opening up data is underpinning public service reforms in several countries, by offering people informed choices that simply have not existed before, thus driving towards improvements in the services to the citizens.
Process and service specification. Information systems are crucial artifacts for running organizations, and organizations rely not only on data, but also, for instance, on processes and services. Designing, documenting, managing, and executing processes is an important aspect of information systems. However, specifying what a process/service does, or which characteristics it is supposed to have, cannot be done correctly and comprehensively without a clear specification of which data the process will access, and how it will possibly change such data. The difficulties of doing that in a satisfactory way come from various factors, including the lack of modeling languages and tools for describing process and data holistically. However, the problems related to the semantics of data that we discussed above undoubtedly make the task even harder.
A (new?) paradigm
In the last five years, I (and my group in Roma) have been working on a new paradigm addressing these issues, based on the use of knowledge representation and reasoning techniques, and I want to share my excitement about it with the readers of this blog. The paradigm is called “Ontology-based Data Management” (OBDM), and requires structuring the information system into four layers.
The distinguishing feature of the whole approach is that users of the system will be freed from all the details of how to use the resources, as they will express their needs in the terms of the DKB. The system will reason about the DKB and the mappings, and will reformulate the needs in terms of appropriate calls to services provided by resources. Thus, for instance, a user query will be formulated over the domain ontology, and the system will reason upon the ontology and the mappings to call suitable queries over data sources that will compute the answers to the original user query.
As you can see, the heart of the approach is the DKB, and the core of the DKB is the ontology. So, what is new? Indeed, I can almost hear many of you saying: what is the difference with data integration (where the global schema plays the role of the ontology)? And what is the difference with conceptual modeling (where the conceptual schema plays the role of the ontology)? And what about Knowledge Representation in AI (where the axioms of the knowledge base play the role of the DKB)? The answer is simple: almost none. Indeed, OBDA builds on all the above disciplines (and others), but with the goal of going beyond what they currently provide for solving the problems that people encounter in the governance of complex information systems. At the same time, there are a few (crucial, at least for me) facts that make OBDA a novel paradigm to experiment and study. Here is a list of the most important ones:
A lot more to do
A few research groups are experimenting OBDM in practice (see, for example, the Optique IP project, financed by the Seventh Framework Program (FP7) of the European Commission). In Rome, we are involved in applied projects both with Public Administrations, and with private companies. One of the experiences we are carrying out is with the Department of Treasury of the Italian Ministry of Economy and Finance. In this project, three ontology experts from our department worked with three domain experts for six months, and built an ontology of 800 elements, with 3000 DL-Lite axioms, and 800 mapping assertions to about 80 relational tables. The ontology is now used as a common framework for all the applications, and will constitute the main document specifying the requirement for the restructuring of the information system that will be carried out in the next future. We are actually lucky to live in Rome, not only because it is a magnificent city, but also because Italian Public Administrations, many of which are located in the Eternal City, provide perfect examples of all the problems that make OBDM interesting and potentially useful…
The first experiences we have conducted are very promising, but OBDM is a young paradigm, and therefore it needs attention and care. This means that there are many issues to be addressed to make it really effectively work in practice. Let me briefly illustrate some of them. One big issue is how to build and maintain the ontology (and, more generally, the DKB). I know that this is one of the most important criticisms to all the approaches requiring a considerable modeling effort. My answer to these is that all modeling efforts are investments, and when we judge about investments we should talk not only about costs, but also about benefits. Also, take into account that OBDM works in a “pay-as-you-go” fashion: users have interesting advantages even with a very incomplete domain description, as the system can reason about an incomplete specification, and try to get the best out of it. Another important issue is evolution. Evolution in OBDM concerns not only the data at the sources (updates), but also the ontology, and the mappings. Indeed, both the domain description, and the resources continue to evolve, and all the components of the system should keep up with these modifications. Not surprisingly, this is one issue where more research is still desperately needed. Overall, the DKB and the mappings constitute the meta-data of the OBDM system, and in complex organizations, such meta-data can be huge and difficult to control and organize. Talking in terms of a fashionable terminology, with OBDM we face not only the problem of Big Data, but also the problem of Big Meta-Data. Another issue that needs to be further studied and explored is the relationship between the static aspects and the dynamic aspects of the DKB, together with the problem of mapping processes and services specified at the conceptual level to computational resources in applications.
I really hope that this blog have somehow triggered your attention to OBDM, and that you will consider looking at it more closely, for example for carrying out some experiments, or for doing research on at least some of its many open problems that still remain to be studied.
| Blogger’s Profile:
Maurizio Lenzerini is a full professor in Computer Science and Engineering at the University of Rome La Sapienza, where he is leading a research group on Databases and Artificial Intelligence. His main research interests are in database theory, data and service and integration, ontology languages, knowledge representation and reasoning, and component-based software development. He is a former Chair and a current member of the Executive Committee of ACM PODS (Principles of Database Systems). He is an ACM Fellow, an ECAI (European Coordinating Committee for Artificial Intelligence) fellow, and a member of the Academia Europaea – The Academy of Europe.
Got your attention? Now that I have it, I would like to take a few minutes to discuss the role of limited attention and information overload in science. Attentive acts such as reading a scientific paper (or a tweet), answering an email, or watching a video require mental effort, and since the human brain’s capacity for effort is limited (by its oxygen and glucose consumption requirements), so is attention. Even if we could spend all of our time reading papers or answering emails, there is only so much we could read in 24 hours. In reality, we spend far less time attending to things, like science papers, before we get tired, bored, or distracted by other demands of our busy lives.
The situation is actually worse. Not only is attention limited, but we must also divide it among a rapidly proliferating number of information sources. Every minute on the Web thousands of new blog posts are written, hours of video are uploaded to YouTube, and hundreds of thousands of status updates are posted on Facebook and Twitter. The number of scientific papers posted to Arxiv.org (see figure) has grown steadily since its inception to more than 7000 a month! Even if you console yourself thinking that only a small fraction of papers is relevant to you, I am willing to bet that the number of papers submitted to the conferences you care about, not to mention the number of conferences and journals themselves, has also grown over the years. What this adds up to is rather nasty case of information overload.
The information overload, coupled with limited attention, reduces the likelihood anyone will notice a specific paper (or another item of information). As a consequence, even real gems will often fail to attract attention, and fade from collective awareness as new items appear on the scene.
The collective neglect is apparent in the figure above, which reports the time to first citation versus the age of a paper published in the journals of the American Physical Society, a leading venue for publishing physics research. A newly published paper is very quickly forgotten. After a paper is a year old, its chances of getting discovered drop like a rock!
One of the puzzles of modern life is that with so much information created daily, people are increasingly consuming more of the same information. Every year, more people watch the same movies, read the same books and cite the same papers than in the previous year. With so many videos available on YouTube, it is a wonder that hundreds of millions have chosen to watch “Gangnam style” video instead.
More alarmingly, this trend has only gotten worse. The gap between those who are “rich” in attention and those who are “poor” has grown steadily. One way to measure attention inequality is to look at the distribution of the number of citations. The figure above shows the gini coefficient of the number of citations received by physics papers in different decades. Gini coefficient, a popular measures inequality, is zero when all papers receive the same number of citations, and one paper gets all the citations. Though the gini coefficient of physics citations is already high in 1950s, it manages to grow over the subsequent decades. What this means is that a shrinking fraction of papers is getting all the citations. Yes, the rich (in attention) just keep getting richer.
Incidentally, inequality is rising not only in science, but also in other domains, presumably for the same reasons. Take, for example, movies. Though the total box office revenues have been rising steadily over the years, this success can be attributed to an ever-shrinking number of huge blockbusters. The figure above shows the gini coefficient of box office revenues of 100 top-grossing movies that came out in different years. Again, inequality is rising, though not nearly as badly in Hollywood as among scientists!
When attention is scarce, the decisions about how to allocate it can have dramatic outcomes. Social scientists have discovered that people do not always rationally weigh alternatives, relying instead on a variety of heuristics, or cognitive shortcuts, to quickly decide between many options. Our study of scientific citations and social media has identified some common heuristics people use to decide which tweets to read or scientific papers to cite. It appears that information that is easy to find receives more attention. People typically read Web pages from the top down; therefore, items appearing at the top of the page have greater visibility. This is the reason why Twitter users are more likely to read and respond to recent messages from friends, which appear near the top of their Twitter stream. Older messages that are buried deep in the stream may never be seen, because users leave Twitter before reaching them.
Visibility also helps science papers receive more citations. A study by Paul Ginsparg, creator of the Arxiv.org repository of science papers, confirmed an earlier observation that articles that are listed at the top of Arxiv’s daily digest receive more citations than articles appearing in lower positions. A paper is also easier to find when other well-read papers cite it. Such indirect exposure increases a paper’s visibility, and the number of new citations it receives. However, being cited by a paper with a long bibliography will not results in many new citations, due to the greater effort required to find a specific item in a longer list. In fact, being cited by a review paper is a kiss of death. Not only is it newer, and scientists prefer to cite more recent papers, but also a review paper typically makes hundreds of references, decreasing the likelihood of discovery for any paper in this long list.
Mitigating Information Overload
No individual can keep pace with the growing deluge of information. The heuristics and mental shortcuts we use to decide what information to pay attention to can have non-trivial consequences on how we create and consume information. The result is not only growing inequality and possible neglect of high quality papers. I believe that information overload can potentially stifle innovation, for example, by creating inefficiencies in the dissemination of knowledge. Already the inability to keep track of relevant work (not only because there is so much more relevant work, but also because we need to read so much more to discover it) can lead researchers to unwittingly duplicate existing results, expending effort that may have been better spent on something else. It has also been noted that the age at which an inventor files his or her first patent has been creeping up, presumably because there is so much more information to digest before creating an innovation. A slowing pace of innovation, both scientific and technological, can threaten our prosperity.
Short of practicing unilateral disarmament by writing fewer papers, what can a conscientious scientist do about information overload? One way that scientists can compensate for information overload is by increasing their cognitive capacity via collaborations. After all, two brains can process twice as much information. A trend towards larger collaborations has been observed in all scientific disciplines (see Wuchty et al. “The Increasing Dominance of Teams in Production of Knowledge”), although coordinating social interactions that come with collaboration can also tax our cognitive abilities.
While I am a technological optimist, I do not see an algorithmic solution to this problem. Although algorithms could monitor people’s behavior to pick out items that receive more attention than expected, there is the danger that algorithmic prediction will become a self-fulfilling prophecy. For example, a paper that is highly ranked by Google Scholar will get lots of attention whether it deserves it or not. In the end, it may be better tools for coordinating social interactions of scientific teams, coupled with algorithms that direct collective attention to efficiently evaluate content, that will provide some relief from information overload. It better be a permanent solution that scales with the continuing growth of information.
| Blogger’s Profile:
Kristina Lerman is a Project Leader at the University of Southern California Information Sciences Institute and holds a joint appointment as a Research Associate Professor in the USC Computer Science Department. After a brief stint as a theoretical roboticist, she found her calling in blending together methods from physics, computer science and social science to address problems in social computing and social media analysis. She writes many papers that are greatly enjoyed by all of their twenty readers.
Big Data should be Interesting Data!
There are various definitions of Big Data; most center around a number of V’s like volume, velocity, variety, veracity – in short: interesting data (interesting in at least one aspect). However, when you look into research papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see here in experimental studies is utterly boring. Performance and scalability experiments are often based on the TPC-H benchmark: completely synthetic data with a synthetic workload that has been beaten to death for the last twenty years. Data quality, data cleaning, and data integration studies are often based on bibliographic data from DBLP, usually old versions with less than a million publications, prolific authors, and curated records. I doubt that this is a real challenge for tasks like entity linkage or data cleaning. So where’s the – interesting – data in Big Data research?
Surely, companies have their own interesting data, and industrial labs have access to such data and real-life workloads. However, all this is proprietary and out of reach for academic research. Therefore, many researchers resort to the good old TPC-H benchmark and DBLP records and call it Big Data. Insights from TPC-H and DBLP are, however, usually not generalizable to interesting and truly challenging data and workloads. Yes, there are positive exceptions; I just refer to a general trend.
Looking Across the Fence: Experimental Data in other Research Communities
Now that I got you alerted, let me be constructive. I have also worked in research communities other than database systems: information retrieval, Web and Semantic Web, knowledge management (yes, a bit of AI), and recently also computational linguistics (aka. NLP). These communities have a different mindset towards data resources and their use in experimental work. To them, data resources like Web corpora, annotated texts, or inter-linked knowledge bases are vital assets for conducting experiments and measuring the progress in the field. These are not static benchmarks that are defined once every ten years; rather, relevant resources are continuously crafted and their role in experiments is continuously re-thought. For example, the IR community has new experimental tasks and competitions in the TREC, INEX, and CLEF conferences each year. Computational linguistics has an established culture of including the availability of data resources and experimental data (such as detailed ground-truth annotations) in the evaluation of submissions to their top conferences like ACL, EMNLP, CoNLL, and LREC. Review forms capture this aspect as an important dimension for all papers, not just for a handful of specific papers tagged Experiments & Analyses.
Even the Semantic Web community has successfully created a huge dataset for experiments: the Web of Linked Data consisting of more than 30 Billion RDF triples from hundreds of data sources with entity-level sameAs linkage across sources. What an irony: ten years ago we database folks thought of Semantic Web people as living in the ivory tower, and now they have more data to play with than we (academic database folks) can dream of.
Towards a Culture Shift in Our Community
Does our community lack the creativity and agility that other communities exhibit? I don’t think so. Rather I believe the problem lies in our publication and experimental culture. Aspects of this topic were discussed in earlier posts on the SIGMOD blog, but I want to address a new angle. We have over-emphasized publications as an achievement by itself: our community’s currency is the paper count rather than the intellectual insight and re-usable contribution. Making re-usable software available is appreciated, but it’s a small point in the academic value system when it comes to hiring, tenure, or promotion decisions. Contributing data resources plays an even smaller role. We need to change this situation by rewarding work on interesting data resources (and equally on open-source software): compiling the data, making it available to the community, and using it in experiments.
There are plenty of good starting points. The Web of Linked Data, with general-purpose knowledge bases (DBpedia, Freebase, Yago) and a wealth of thematically focused high-quality sources (e.g., musicbrainz, geospecies, openstreetmap, etc.), is a great opportunity. This data is huge, structured but highly heterogeneous, and includes substantial parts of uncertain or incomplete nature. Internet archives and Web tables (embedded in HTML pages) are further examples; enormous amounts of interesting data are easily and legally available by crawling or download. Finally, in times when energy, traffic, environment, health, and general sustainability are key challenges on our planet, more and more data by public stakeholders is freely available. Large amounts of structured and statistical data can be accessed at organizations like OECD, WHO, Eurostat, and many others.
Merely pointing to these opportunities is not enough. We must give more incentives that papers do indeed provide new interesting data resources and open-source software. The least thing to do is to extend review reports to include the contribution of novel data and software. A more far-reaching step is to make data and experiments an essential part of the academic currency: how many of your papers contributed data resources, how many contributed open-source software? This should matter in hiring, tenure, and promotion decisions. Needless to say, all this applies to non-trivial, value-adding data resource contributions. Merely converting a relational database into another format is not a big deal.
I believe that computational linguistics is a great role model for experimental culture and the value of data. Papers in premier conferences earn extra credit when accompanied with data resources, and there are highly reputed conferences like LREC which are dedicated to this theme. Moreover, papers of this kind or even the data resources themselves are frequently cited. Why don’t we, the database community, adopt this kind of culture and give data and data-driven experiments the role that they deserve in the era of Big Data?
Is the Grass Always Greener on the Other Side of the Fence?
Some people may argue that rapidly changing setups for data-driven experiments are not viable in our community. In the extreme, every paper could come with its own data resources, making it harder to ensure the reproducibility of experimental results. So we should better stick to established benchmarks like TPC-H and DBLP author cleaning. This is the opponent’s argument. I think the argument that more data resources hinder repeatability is flawed and merely a cheap excuse. Rather, a higher rate of new data resources and experimental setups goes very well with calling upon the authors’ obligation to ensure reproducible results. The key is to make the publication of data and full details of experiments mandatory. This could be easily implemented in the form of supplementary material that accompanies paper submissions and, for accepted papers, would also be archived in the publication repository.
Another argument could be that Big Data is too big to effectively share. However, volume is only one of the criteria for making a dataset Big Data, that is, interesting for research. We can certainly make 100 Gigabytes available for download, and organizations like NIST (running TREC), LDC (hosting NLP data), and the Internet Archive prove that even Terabytes can be shared by asking interested teams to pay a few hundred dollars for shipping disks.
A caveat that is harder to counter is that real-life workloads are so business-critical that they can impossibly be shared. Yes, there were small scandals about query-and-click logs from search engines as they were not properly anonymized. However, the fact that engineers did not do a good job in these cases does not mean that releasing logs and workloads is out of the question. Why would it be impossible to publish a small representative sample of analytic queries over Internet traffic data or advertisement data? Moreover, if we focus on public data hosted by public services, wouldn’t it be easy to share frequently posed queries?
Finally, a critical issue to ponder on is the position of industrial research labs. In the SIGMOD repeatability discussion a few years ago, they made it a point that software cannot be disclosed. Making experimental data available is a totally different issue, and would actually avoid the problem with proprietary software. Unfortunately, we sometimes see papers from industrial labs that show impressive experiments, but don’t give details nor any data and leave zero chance for others to validate the papers’ findings. Such publications that crucially hinge on non-disclosed experiments violate a major principle of good science: the falsifiability of hypotheses, as formulated by the Austrian-British philosopher Karl Popper. So what should industrial research groups do (in my humble opinion)? They should use public data in experiments and/or make their data public (e.g., in anonymized or truncated form, but in the same form that is used in the experiments). Good examples in the past include the N-gram corpora that Microsoft and Google released. Papers may use proprietary data in addition, but when a paper’s contribution lives or dies with a large non-disclosed experiment, the paper cannot be properly reviewed by the open research community. For such papers, which can still be insightful, conferences have industrial tracks.
Last but not least, who could possibly act on this? Or is all this merely public whining, without addressing any stakeholders? An obvious answer is that the steering boards and program chairs of our conferences should reflect and discuss these points. It should not be a complex maneuver to extend the reviewing criteria for the research tracks of SIGMOD, VLDB, ICDE, etc. This would be a step in the right direction. Of course, truly changing the experimental culture in our community and influencing the scholarly currency in the academic world is a long-term process. It is a process that affects all of us, and should be driven by each of you. Give this some thought when writing your next paper with data-intensive experiments.
The above considerations are food for thought, not a recipe. If you prefer a concise set of tenets and recommendations at the risk of oversimplification, here is my bottom line:
Overall, we need a culture shift to encourage more work on interesting data for experimental research in the Big Data wave.
| Blogger’s Profile:
Gerhard Weikum Gerhard Weikum is a Research Director at the Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also an adjunct professor in the Department of Computer Science of Saarland University in Saarbruecken, Germany, and he is a principal investigator of the Cluster of Excellence on Multimodal Computing and Interaction. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He received his diploma and doctoral degrees from the University of Darmstadt, Germany.
Fifty years ago a small team working to automate the business processes of the General Electric Low Voltage Switch Gear Department in Philadelphia built the first functioning prototype of a database management system. The Integrated Data Store was designed by Charles W. Bachman, who later won the ACM’s Turing Award for the accomplishment. He was the first Turing Award winner without a Ph.D., the first with a background in engineering rather than science, and the first to spend his entire career in industry rather than academia.
The exact anniversary of IDS is hard to pin down. Detailed functional specifications for the system were complete by January 1962, and Bachman was presenting details of the planned system to GE customers by May of that year. It is less clear from archival materials when the system first ran, but Bachman’s own recent history of IDS suggests that a prototype was operational by the end of that year.
According to this May 1962 presentation, initial implementation of IDS was expected to be finished by December 1962, with several months of field testing and debugging to follow. Image courtesy of Charles Babbage Institute.
The technical details of IDS, Bachman’s life story, and the context in which it arose have all been explored elsewhere in some detail. He also founded a public company, played a leading role in formulating the OSI seven layer model for data communications, pioneered online transaction processing, and devised the first data modeling notation. Here I am focused on two specific questions:
(1) why do we view IDS as the first database management system, and
(2) what were its similarities and differences with later systems?
There will always be an element of subjectivity in such judgments about “firsts,” particularly as IDS predated the concept of a database management system and so cannot be compared against definitions from the time period. I have elsewhere explored the issue in more depth, stressing the way in which IDS built on early file management and report generation systems and the further evolution of database ideas over the next decade. As a fusty historian I value nuance and am skeptical of the idea that any important innovation can be fully understood by focusing on a single moment of invention.
However, if any system deserves the title of “first database management system” then it is clearly IDS. It served as a model for the earliest definitions of “data base management system” and included most of the core capabilities later associated with the concept.
What was IDS For?
Bachman was not, of course, inspired to create IDS as a contribution to the database research literature. For one thing there was no database research community. At the start of the 1960s computer science was beginning to emerge as an academic field, but its early stars focused on programming language design, theory of computation, numerical analysis, and operating system design. The phrase “data base” was just entering use but was not particularly well established and Bachman’s choice of “data store” would not have seemed any more or less familiar at the time. In contrast to this academic neglect, the efficient and flexible handling of large collections of structured data was the central challenge for what we would now call corporate information systems departments, and was then called business data processing.
During the early 1960s the hype and reality of business computing diverged dramatically. Consultants, visionaries, business school professors, and computer salespeople had all agreed that the best way to achieve real economic payback from computerization was to establish a “totally integrated management information system.” This would integrate and automate all the core operations of a business, ideally with advanced management reporting and simulation capabilities built right in. The latest and most expensive computers of the 1960s had new capabilities that seemed to open the door to a more aggressive approach. Compared to the machines of the 1950s they had relatively large memories, featured disk storage as well as tape drives, could process data more rapidly, and some had even been used to drive interactive terminals in specialized applications. Unfortunately the reality of data processing changed much more slowly, and remained focused on simple administrative applications that batch processed large files of records to accomplish discrete tasks such as weekly payroll processing, customer statement generation, or accounts payable reports.
Many companies announced their intention to build totally integrated management information systems, but few ever claimed significant success. A modern reader would not be shocked to learn that firms were unable to create systems of comparable scope to today’s Enterprise Resources Planning and data warehouse projects using computers with perhaps the equivalent of 64KB of memory, no real operating system, and a few megabytes of disk storage. Still, even partially integrated systems covering significant portions of a business with flexible reporting capabilities would have real value. The biggest challenges to even modest progress towards this goal were the sharing of data between applications and the effective use of random access disk storage by application programmers.
Data processing techniques had evolved directly from those used with pre-computer mechanical punched card machines. The concepts of files, fields, keys, grouping, merging data from two files, and the hierarchical combination of master and detail records within a single file all predated electronic computers. These worked with magnetic tape much as they had done with punched cards, except that sorting was actually much harder with tape. Getting a complex job done might involve dozens of small programs and the generation of many working tapes full of intermediate data. These banks of whirring tape drives provided computer centers with their main source of visual interest in the movies of the era. However the formats of tape files were very inflexible and were usually fixed by the code of the application programs working with the data. Every time a field was added or changed all the programs working with the file would need to be rewritten. But if applications were integrated, for example by having order records from the sales accounting system automatically exported as input for the production scheduling application, then the resulting web of dependencies would make it ever harder to carry out even minor changes in response to shifting business needs.
This 1962 diagram, drawn by Stanley Williams, sketched the complex dependencies between different records involved in the production planning process. Courtesy Charles Babbage Institute.
The other key challenge was making effective use of random access storage in business application programs. Sequential tape storage was conceptually simple, and the tape drives themselves provided some intelligence to aid programmers in reading or writing records. But the only really practical computer applications were batch-oriented because searching a tape to find or update a particular record was too slow to be practical. Instead, master files were periodically updated with accumulated data or read through to produce reports. The arrival in the early 1960s of disk storage a programmer theoretically made it possible to apply updates one at a time as new data came in, or to create interactive systems that could respond to requests immediately. A programmer could easily instruct the drive to pull data from any particular platter or track, but the hard part was figuring out where on the disk the desired record could be found. Harnessing the power of the new technology meant finding ways to order, insert, delete, or search for records that did not simply replicate the sequential techniques used with tape. Solutions such as indexing, inverted files, hashing, linked lists, chains and so on were quickly devised but these were relatively complex to implement and demanded expert judgment to select the best method for a particular task. In addition, application programmers were beginning to shift from assembly language to high level languages such as COBOL. Business oriented languages included high level support for working with structuring data in tape files but lacked comparable support for random access storage. Without significant disk file management support from the rudimentary operating systems of the era only elite programmers could hope to create of an efficient random access application.
This image, from a 1962 internal General Electric document, conveyed the idea of random access storage using a set of “pigeon holes” in which data could be placed. Courtesy of Charles W. Bachman.
IDS was intended to substantially solve these two problems, so that applications could be integrated to share data files and ordinary programmers could effectively develop random access applications using high level languages. Bachman designed it to meet the needs of an integrated systems project run as an experimental prototype within General Electric by the group of systems-minded specialists he was working for at its corporate headquarters. General Electric had many factories spread over its various divisions, and could not produce a different integrated system for each one. Furthermore it was entering the computer business, and recognized that a flexible and generic integrated system based on disk storage would be a powerful tool in selling its machines to other companies.
Was IDS a Data Base Management System?
IDS carried out what we still consider the core task of a database management system by interposing itself itself between application programs and the files in which they stored data. Programs could not manipulate data files directly, instead making calls to IDS so that it would perform the rested operation on their behalf.
Like modern database management systems, IDS explicitly stored and manipulated metadata about the records and their relationships, rather than expecting each application program to understand and respect the format of every data file it worked with. It could enforce relationships between different record types, and would protect database integrity. Database designers would specify indexes and other details of record organization to boost performance based on expected usage patterns. However the first versions it did not include a formal data manipulation language. Instead of being defined through textual commands the metadata was punched onto specially formatted input cards. A special command told IDS to read and apply this information.
IDS was designed to be used with high level programming languages. In the initial prototype version, operational in 1962, this was General Electric’s own GECOM language, though performance and memory concerns drove Bachman’s team to shift to assembly language for the application programming in a higher performance version completed in 1964. Part of IDS remained resident in memory while application programs were executed. Calls to IDS operations such as store, retrieve, modify, and delete were interpreted at runtime against the latest metadata and then executed. As high level languages matured and memory grew less scarce, later version of IDS were oriented towards application programs written in COBOL.
This provided a measure of what is now called data independence for programs. If a file was restructured to add fields or modify their length then the programs using it would continue to work properly. Files could be moved around and reorganized without rewriting application programs. That made running different application programs against the same database much more feasible. IDS also included its own system of paging data in and out of memory, to create a virtual memory capability transparent to the application programmer.
The concept of transactions is fundamental to modern database management systems. Programmers specify that a series of interconnected updates must take place together, so that if one fails or is undone they all are. IDS was also transaction oriented, though not in exactly the same sense. It took over the entire computer, which had only 8,000 words of memory. Bachman devised an innovative transaction processing system, which he called the Problem Controller. Bachman’s original version of IDS was not loaded by application programs when needed to handle a data operation. Instead IDS reversed the relationship: it ran when the computer booted and loaded application programs as needed. Only one application program ran at a time. Requests from users to run particular programs were read from “problem control cards” and buffered as IDS records. The computer worked its way through the queue of requests, updating it after each job was finished. By 1965 an improved version of this system was in use at Weyerhauser, on a computer hooked up to a national teletype network. Requests for reports and data submissions were inserted directly into the queue by remote users.
Bachman’s original prototypes lacked strong backup and recovery systems, key features of later database management systems, but this was added as early as 1964 when IDS was first being prepared as a package for distribution to General Electric’s customers. A recovery tape logged memory pages modified by each transaction, so that the database could be restored to a consistent state if something went wrong before the transaction was completed. The same tape served as an incremental backup of changes since the last full backup.
This first packaged version of IDS did lack some features later viewed as essential for database management systems. One was the idea that specific users could be granted or denied access to particular parts of the database. This was related to another limitation: IDS databases could be queried or modified only by writing and executing programs in which IDS calls were included. There was no interactive capability to request “ad hoc” reports or run one-off queries without having to write a program. Easy to use report generator systems (such as 9PAC and MARK IV) and online interactive data management systems (such as TDMS) were created during the 1960s but they were generally seen as a separate class of software from data base management systems. (By the 1970s these packages were still popular, but included optional modules to interface with data stored in database management systems).
IDS and CODASYL
After Bachman handed IDS over to a different team within General Electric in 1964 it was made available as a documented and supported software package for the company’s 200 series computers. Later versions supported its 400 and 600 series systems. New versions followed in the 1970s after Honeywell brought out General Electric’s computer business. IDS was a strong product, in many respects more advanced than IBM’s IMS which appeared several years later. However IBM machines dominated the industry so software from other manufacturers was doomed to relative obscurity whatever its merits. In those days software packages from computer manufactures were paid for by hardware sales and given to customers without an additional charge.
During the late 1960s the ideas Bachman created for IDS were taken up by the Database Task Group of CODASYL, a standards body for the data processing industry best known for its creation and promotion of the COBOL language. Its 1969 report drew heavily on IDS in defining a proposed standard for database management systems, in part thanks to Bachman’s own service on the committee. In retrospect, the committee’s work, and a related effort by CODASYL’s Systems Committee to evaluate existing systems within the new framework, were significant primarily for formulating and spreading the concept of a “data base management system.”
CODASYL’s definition of the architecture of a database management system and its core capabilities was quite close to that included in textbooks to this day. In particular, it suggested that a data base management system should support on-line, interactive applications as well as batch driven applications and have separate interfaces. COSASYL’s initial report, published in 1969, documented foundational concepts and vocabulary such as data definition language, data manipulation language, schemas, data independence, and program independence. It went beyond early versions of IDS by adding security features, including the idea of “privacy locks” and included “sub-schemas,” roughly equivalent to views in relational systems, so that different programs could work with specially presented subsets of the overall content of the database.
Although IBM itself refused to support the CODASYL approach, continuing to favor its own IMS with its simple hierarchical data model, many other computer vendors supported its recommendations and eventually produced systems incorporating these features. The most successful CODASYL system, IDMS, came from an independent software company and began as a port of IDS to IBM’s dominant System/360 mainframe platform.
The Legacy of IDS
IDS and CODASYL systems did not use the relational data model, formulated years later by Ted Codd, which underlies today’s dominant SQL database management systems. Instead it introduced what would later be called the “network data model.” This encoded relationships between different kinds of records as a graph, rather than the strict hierarchy enforced by tape systems and some other software packages of the 1960s such as IBM’s later and widely used IMS. The network data model was widely used during the 1970s and 1980s, and commercial database management systems based on this approach were among the most successful products of the mushrooming packaged software industry.
Bachman spoke memorably in his Turing Award lecture of the “Programmer as Navigator,” charting a path through the database from one record to another. The IDS approach required programmers to work with records one at a time. Performing the same operation on multiple records mean retrieving a retrieving a record, processing and if necessary updating it, and then moving on to the next record of interest to repeat the process. For some tasks this made programs longer and more cumbersome than the equivalent in a relational system, where a task such as deleting all records more than a year old or adding 10% to the sales price of every item could be performed with a single command. In addition, IDS and other network systems encoded what we now think of as the “joins” between different kinds of records as part of the database structure rather than specifying them in each query. This made IDS much less flexible than later relational systems, but also much simpler to implement and more efficient for routine operations.
This drawing, from the 1962 presentation “IDS: The Information Processing Machine We Need,” shows the use of chains to connect record. The programmer used GET commands to navigate between related records.
IDS was a useful and practical tool for business use in the early 1960s, while relational systems were not commercially available until the 1980s. Relational systems did not become feasible until computers were orders of magnitude more powerful than they had been in 1962 and some extremely challenging implementation issues had been overcome. Even after relational systems were commercialized the two approaches were seen for some time as complementary, with network systems used for high performance transaction processing systems handling routine operations on large numbers of records (for example credit card processing) and relational systems best suited for flexible “decision support” data crunching. Although IDMS is still in use for some few very large applications it, and other database management systems based on Bachman’s network data model, have long since been superseded for new applications and for mainstream computing needs.
Still, without IDS and Bachman’s tireless championing of the ideas it contained the very concept of a “database management system” might never have taken root in the first place. When database specialists look at IDS today it is easy to see its limitations compared to modern systems. Its strengths are easy to miss because its huge influence on the software industry meant that much of what was revolutionary about it in 1962 was soon taken for granted. IDS did more than any other single piece of software to broaden the range of business problems to which computers could usefully be applied and so to usher in today’s world where every administrative transaction involves a flurry of database queries and updates rather than the filing of forms completed in triplicate.
| Blogger’s Profile:
Thomas Haigh is an Associate Professor of Information Studies at the University of Wisconsin–Milwaukee. He chairs SIGCIS, the group for historians of information technology, and has published widely on different aspects of the history of computing. Learn more at www.tomandmaria.com/tom
As data is becoming increasingly more important in our society, there are many successful companies doing data-related businesses. This field grows so fast that many new startups are launched with the goal to become the “next Google.” This trend also provides a lot of entrepreneurship opportunities for our community working on data management research. This blog describes my experiences of doing a startup (called SRCH2, http://www.srch2.com/) that commercializes university research. It also shares my own perspective on entrepreneurship in data management research.
This blog is based on the talk that I gave at the DBRank workshop at VLDB 2012 and the talk slides are available on my homepage.
SRCH2: Commercializing Data Management Research
One of the research topics I work on at UC Irvine is related to powerful search. It started when I talked to people at the UCI Medical School and asked the question: “What are your data management problems?”. One of the challenges they were facing was record linkage, i.e., identifying that two records from different data sources represent the same real-world entity. An important problem in this context is approximate string search, which is supporting queries with fuzzy matching predicates, such as finding records with keywords similar to the former California “Terminator” governor. While looking into the details, I realized that the problem was not solved on large data sets, so I started leading a research team to work on it. After several years, we developed several novel techniques, and released an open-source C++ package called Flamingo (http://flamingo.ics.uci.edu/), which received a lot of attention from academia and industry. I also took a leave from UCI to work as a visiting scientist at Google, and this experience was very beneficial. It not only showed me how large companies manage data management projects and solve challenging problems, but also taught me how to manage a research team in a university setting.
In 2008, when pushing our research to the UCI community, we identified one “killer app” domain: people search. We developed a system prototype called PSearch (http://psearch.ics.uci.edu/) that supports instant and error-tolerant search. The system gradually became popular on the campus and many people began using it on a daily basis. Many of them told me their personal stories in which they were able to find people quickly, despite their vague recall of names. Meanwhile, collaborating with colleagues at Tsinghua University, we were able to scale the techniques to larger data sets and developed another system called iPubmed (http://ipubmed.ics.uci.edu), which enabled the same features on 21 million MEDLINE publications. We also developed techniques in other domains, such as geo search.
As our systems became more and more popular, very often I got requests from users asking: “Can I run your engine on my own data sets?” As a former PhD from the Stanford Database Group, the home of many successful companies such as Junglee, Google, and Aster Data, I always had the dream of doing my own startup. Then the answer became very natural: “Why don’t we commercialize the results?”. So I incorporated a company in 2008, which was initially called “Bimaple,” and recently renamed to SRCH2 to better describe its search-related business. SRCH2 has developed a search engine (built from the ground up in C++) targeting enterprises that want to enable a Google-like search interface for their customers. It offers a solution similar to Lucene and Sphinx Search, but with more powerful features such as instant search, error correction, geo support, real-time updates, and customizable ranking. Currently its first products are developed and it has paying customers.
(Good) Lessons Learned
In the four years of doing the company so far, I have learned many things that are beyond my imagination. Here are some of the (good) lessons learned so far.
In summary, my entrepreneurship experiences have been challenging but enjoyable and educational. I hope more of you take the adventure and commercialize your research. It can help you “think different.”
| Blogger’s Profile:
Chen Li is a professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D.degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003 and many other NSF grants and industry gifts. He was once a part-time Visiting Research Scientist at Google. His research interests are in the fields of data management and information search, including text search and data-intensive computing. He was a recipient of the SIGMOD 2012 Test-of-Time award. He is the founder of SRCH2, a company providing powerful search solutions for enterprises and developers.
Big Data is the buzzword in the database community these days. Two of the first three blog entries of the SIGMOD blog are on Big Data. There was a plenary research session with invited talks at the 2012 SIGMOD Conference and there will be a panel at the 2012 VLDB Conference. Probably, everything has already been said that can be said. So, let me just add my own personal data point to the sea of existing opinions and leave it to the reader whether I am adding to the “signal” or adding to the “noise”. This blog entry is based on the talk that I gave at SIGMOD 2012 and the slides of that talk can be found at http://www.systems.ethz.ch/Talks .
Upfront, I would like to make clear that I am a believer. Stepping back, I am asking myself why do I work on Big Data technologies? I came up with two potential reasons:
In the following, I would like to explain my personal view on these two reasons.
Making the World a Better Place
The real question to ask is whether bigger = smarter? The simple answer is “yes”. The success of services like the Google and Bing are evidence for the “bigger = smarter” principle. The more data you have and can process, the higher the statistical relevance of your analysis and the better answers you get. Furthermore, Big Data allows you to make statements about corner cases and the famous “long tail”. Putting it differently, “experience” is more valuable than “thinking”.
The more complicated answer to the question whether bigger is smarter is “I do not know”. My concern is that the bigger Big Data gets, the more difficult we make it for humans to get involved. Who wants to argue with Google or Bing? At the end, all we can do is trust the machine learning. However, Big Data analytics needs as much debugging as any other software we produce and how can we help people to debug a data-driven experiment with 5 PB of data? Putting it differently, what do you make out of an experiment that validates your hypothesis with 5 PB of data but does not validate your hypothesis with, say, 1 KB of data using the same piece of code? Should we just trust the “bigger = smarter” principle and use the results of the 5 PB experiment to claim victory?
The more fundamental problem is that Big Data technologies tempt us into doing experiments for which we have no ground truth. Often, the absence of a ground truth is the reason of using Big Data: If we knew the answer already, we would not need Big Data. Despite all the mathematical and statistical tools that are available today, however, debugging a program without knowing what the program should be doing is difficult. To give an example: Let us assume that a Big Data study revealed that the left most lane is the fastest lane in a traffic jam. What does this result mean? Does it mean that we should all be going on the left lane? Does it mean that people on the left lane are more aggressive? Or does it mean that people on the left lane just believe that they are faster? This example combines all the problems of discovering facts without a ground truth: By asking the question, you are biasing the result. And by getting a result, you might be biasing the future result, too. (And, of course, if you had done the same study only looking at data from Great Britain, you might have come to the opposite conclusion that the right most lane is the fastest.)
Google Translate is a counter example and clearly a Big Data success story: Here, we do know the ground truth and Google developers are able to debug and improve Google Translate based on that ground truth – at least as long as we trust our own language skills more than we trust Google. (When it comes to spelling, I actually already trust Google and Bing more than I trust myself. )
Maybe, all I am trying to say is that we need to be more careful in what we promise and do not forget to keep the human in the loop. I trust statisticians that “bigger is smarter”, but I also believe that humans are even smarter and the combination is what is needed, thereby letting each party do what it is best at.
Because We Can
Unfortunately, we cannot make humans become smarter (and we should not even try), but we can try to make Big Data bigger. Even though I argued in the previous section that it is not always clear that bigger Big Data makes the world a better or smarter place, we as a data management community should be constantly pushing to make Big Data bigger. That is, we should build data management tools that scale, perform well, and are cost effective and get continuously better in all regards. Honestly, I do not know how that will make the world a better place, but I am optimistic that it will: History teaches that good things will happen if you do good work. Also, we should not be shy to make big promises such as processing 100 PB of heterogeneous data in real-time – if that is what our customers want and are willing to pay for. We should also continue to encourage people to collect all the data and then later think about what to do with it. If there are risks in doing all that (e.g., privacy risks), we need to look at those, too, and find ways to reduce those risks and still become better at our core business of becoming bigger, faster, and cheaper. We might not be
There are two things that we need to change, however. First, we need to build systems that are explicit about the utility / cost tradeoff of Big Data. Mariposa pioneered this idea in the Nineties; in Mariposa, utility was defined as response time (the faster the higher the utility), but now things get more complicated: With Big Data, utility may include data quality, data diversity, and other statistical metrics of the data. We need tools and abstractions that allow users to explicitly specify and control these metrics.
Second, we need to package our tools in the right way so that users can use them. There is a reason why Hadoop is so successful even though it has so many performance problems. In my opinion, one of the reasons is that it is not a database system. Yet, it can be a database system if combined with other tools of the Hadoop eco-system. For instance, it can be a transactional database system if combined with HDFS, Zookeeper, and HBase. However, it can also become a logging system to help customer support if combined with HDFS and SOLR. And, of course, it can
| Blogger’s Profile:
Donald Kossmann is a professor in the Systems Group of the Department of Computer Science at ETH Zurich (Switzerland). He received his MS in 1991 from the University of Karlsruhe and completed his PhD in 1995 at the Technical University of Aachen. After that, he held positions at the University of Maryland, the IBM Almaden Research Center, the University of Passau, the Technical University of Munich, and the University of Heidelberg. He is a former associate editor of ACM Transactions on Databases and ACM Transactions on Internet Technology. He was a member of the board of trustees of the VLDB endowment from 2006 until 2011, and he was the program committee chair of the ACM SIGMOD Conf., 2009 and PC co-chair of VLDB 2004. He is an ACM Fellow. He has been a co-founder of three start-ups in the areas of Web data management and cloud computing.
Computer science publication culture and practices has become an active discussion topic. Moshe Vardi has written a number of editorials in Communications of ACM on the topic that can be found here and here, and these have generated considerable discussion. The conversation on this issue has been expanding and Jagadish has collected the writings on Scholarly Publications for CRA, which is a valuable resource.
The database community has pioneered discussions on publication issues. We have had panels at conferences, discussions during business meetings, informal conversations during conferences, discussions within SIGMOD Executive and the VLDB Endowment Board – we have been at this since about 2000. I wrote about one aspect of this back in 2002 in my SIGMOD Chair‘s message.
The initial conversation in the database community was due to the significant increase in the number of submitted papers to our conferences that we were experiencing year-after-year. The increasing number of submissions had started to severely stress our ability to meaningfully manage the conference reviewing process. It became quite clear, quite quickly, to a number of us that the overriding problem was our over-reliance on conferences that were not designed to fulfill the role that we were pushing them to play: being the final archival publication venues. I argued this point in my 2002 SIGMOD Chair’s message that I mentioned above. I ended that message by stating that we “have been very successful over the years in convincing tenure and promotion committees and university bodies about the value of the conferences (rightfully so), we now have to convince ourselves that journals are equally valuable and important venues to publish fuller research results.” The same topic was the focus of my presentation on the panel on “Paper and Proposal Reviews: Is the Process Flawed?” that Hank Korth organized at the 2008 CRA Snowbird Conference (the report of the panel appeared in SIGMOD Record and can be accessed here).
This discussion needs to start with our objectives. In an ideal world, what we want are:
The conventional wisdom is that conferences are superior on the first two points and the third point is something we can tinker with (and we have been tinkering with for quite a while with mixed results) while the fourth objective is addressed by a combination of increasing conference paper page limits, decreasing font sizes so we can pack more material per page, and the practice of submitting fuller versions of conference papers to journals. Data suggest that the first issue does not hold – our top journals now have first round review times that are competitive with “traditional” conferences (e.g., SIGMOD and ICDE). The second issue can be addressed by adopting a publication business model that relies primarily on on-line dissemination with print copies released once per volume – this way you don’t wait for print processing, nor do you have to worry about page budgets and the like. Note that I am not talking about “online-first” models, but actually publishing the final version of the paper online as soon as the final version can be produced after acceptance. Journals perform much better on the last two points.
In my view, in the long run, we will follow other science and engineering disciplines and start treating journals as the main outlet for disseminating our research results. However, the road from here to there is not straightforward and there are a number of alternatives that we can follow. Accepting the fact that we, as a community, are not yet willing to give up on the conference model of publication, what are some of the measures we can take? Here are some suggestions:
These are things that we currently do – Proceedings of VLDB (PVLDB) incorporates these suggestions. It represents the current thinking of the VLDB Endowment Board after many years of discussions. Although I had some reservations at the beginning, I have become convinced that it is better than our traditional conferences. However, I am suggesting going further:
As I said earlier, my personal belief is that we will eventually shift our focus to journal publications. What I outlined above is a set of policies we can adopt to move in that direction. For an open membership organization such as SIGMOD, making major changes such as these requires full engagement of the membership. I hope we start discussing.
| Blogger’s Profile:
M. Tamer Özsu is Professor of Computer Science at the David R. Cheriton School of Computer Science of the University of Waterloo. He was the Director of the Cheriton School of Computer Science from January 2007 to June 2010. His research is in data management focusing on large-scale data distribution and management of non-traditional data. His publications include the book Principles of Distributed Database Systems (with Patrick Valduriez), which is now in its third edition. He has also edited, with Ling Liu, the Encyclopedia of Database Systems. He serves as the Series Editor of Synthesis Lectures on Data Management (Morgan & Claypool) and on the editorial boards of three journals, and two book Series. He is a Fellow of the Association for Computing Machinery (ACM), and of the Institute of Electrical and Electronics Engineers (IEEE), and a member of Sigma Xi.
I was recently approached by an entrepreneur who had an interesting way to correlate short term performance of a stock with news reports about the stock. Needless to say, there are many places from which one can get the news, and what results one gets from this sort of analysis does depend on the input news sources. Surprisingly, within two minutes the conversation had drifted from characteristics of news sources to the challenges of running SVM on Hadoop. The reason for this is not that Hadoop is the right infrastructure for this problem. But rather that the problem can legitimately be considered a Big Data problem. In consequence, in the minds of many, it must be addressed by running analytics in the cloud.
I have nothing against cloud services. In fact, I think they are an important part of the computational eco-system, permitting organizations to out-source selected aspects of their computational needs, and to provision peak capacity for load bursts. The map-reduce paradigm is a fantastic abstraction with which to handle tasks that are “embarrassingly parallelizable.” In short, there are many circumstances in which cloud services are called for. However, they are not always the solution, and are rarely the complete solution. For the stock price data analysis problem, based solely on the brief outline I’ve given you, one cannot say whether they are appropriate.
I have nothing against Support Vector Machines, or other machine learning techniques. They can be immensely useful, and I have used them myself in many situations. Scaling up these techniques for large data sets can be an issue, and certainly is a Big Data challenge. But for the problem at hand, I would be much more concerned about how it was modeled than how the model was scaled. What should the features be? Do we worry about duplicates in news appearances? Into how many categories should we classify news mentions? These are by far the more important questions to answer, because how we answer them can change what results we get: scaling better will only change how fast we get them.
It is hard to avoid mention of Big Data anywhere we turn today. There is broad recognition of the value of data, and products obtained through analyzing it. Industry is abuzz with the promise of big data. Government agencies have recently announced significant programs towards addressing challenges of big data. Yet, many have a very narrow interpretation of what that means, and we lose track of the fact that there are multiple steps to the data analysis pipeline, whether the data are big or small. At each step, there is work to be done, and there are challenges with Big Data.
The first step is data acquisition. Some data sources, such as sensor networks, can produce staggering amounts of raw data. Much of this data is of no interest, and it can be filtered and compressed by orders of magnitude. One challenge is to define these filters in such a way that they do not discard useful information. For example, in considering news reports, is it enough to retain only those that mention the name of a company of interest? Do we need the full report, or just a snippet around the mentioned name? The second big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured. This metadata is likely to be crucial to downstream analysis. For example, we may need to know the source for each report if we wish to examine duplicates.
Frequently, the information collected will not be in a format ready for analysis. The second step is an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis. A news report will get reduced to a concrete structure, such as a set of tuples, or even a single class label, to facilitate analysis. Furthermore, we are used to thinking of Big Data as always telling us the truth, but this is actually far from reality. We have to deal with erroneous data: some news reports are inaccurate.
Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis all of this has to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then “robotically” resolvable. Even for simpler analyses that depend on only one data set, there remains an important question of suitable database design. Usually, there will be many alternative ways in which to store the same information. Certain designs will have advantages over others for certain purposes, and possibly drawbacks for other purposes.
Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big-data computing environments. A problem with current Big Data analysis is the lack of coordination between database systems, which host the data and provide SQL querying, with analytics packages that perform various forms of non-SQL processing, such as data mining and statistical analyses. Today’s analysts are impeded by a tedious process of exporting data from the database, performing a non-SQL process and bringing the data back.
Having the ability to analyze Big Data is of limited value if users cannot understand the analysis. Ultimately, a decision-maker, provided with the result of analysis, has to interpret these results. Usually, this involves examining all the assumptions made and retracing the analysis. Furthermore, as we saw above, there are many possible sources of error: computer systems can have bugs, models almost always have assumptions, and results can be based on erroneous data. For all of these reasons, users will try to understand, and verify, the results produced by the computer. The computer system must make it easy for her to do so by providing supplementary information that explains how each result was derived, and based upon precisely what inputs.
In short, there is a multi-step pipeline required to extract value from data. Heterogeneity, incompleteness, scale, timeliness, privacy and process complexity give rise to challenges at all phases of the pipeline. Furthermore, this pipeline isn’t a simple linear flow – rather there are frequent loops back as downstream steps suggest changes to upstream steps. There is more than enough here that we in the database research community can work on.
To highlight this fact, several of us got together electronically last winter, and wrote a white paper, available at http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf . Please read it, and say what you think. The database community came very late to much of the web. We should make sure not to miss the boat on Big Data.
My post is loosely based on an extract from this white paper, which was created through a distributed conversation among many prominent researchers listed below.
Divyakant Agrawal, UC Santa Barbara
| Blogger’s Profile:
H. V. Jagadish is Bernard A Galler Collegiate Professor of Electrical Engineering and Computer Science and Director of the Software Systems Research Laboratory at the University of Michigan , Ann Arbor. He is well-known for his broad-ranging research on information management, and particularly its use in biology, medicine, telecommunications, finance, engineering, and the web. He is an ACM Fellow and founding Editor in Chief of PVLDB. He serves on the board of the Computing Research Association.
tl;dr: MADlib is an open-source library of scalable in-database algorithms for machine learning, statistics and other analytic tasks. MADlib is supported with people-power from Greenplum; researchers at Berkeley, Florida and Wisconsin are also contributing. The project recently released a MADlib TR, and is now welcoming additional community contributions.
Warehousing → Science
Back in 2008, I had the good fortune to fall in with a group of data professionals documenting new usage patterns in scalable analytics. It was an interesting team: a computational advertising analyst at a large social networking firm, a seasoned DBMS consultant formerly employed at a major Internet retailer, a pair of DBMS engine developers and an academic.
The usage patterns we were seeing represented a shift from accountancy to analytics—from the cautious record-keeping of “Data Warehousing” to the open-ended, predictive task of “Data Science”. This shift was turning many Data Warehousing tenets on their heads. Rather than “architecting” an integrated permanent record that repelled data until it was well-conditioned, the groups we observed were interested in fostering a data-centric computational “watering hole”, where analysts could bring any kind of relevant data into a shared infrastructure, and experiment with ad-hoc integration and rich algorithmic analysis at very large scales.
In response to the dry TLAs of Data Warehousing, we dubbed this usage model MAD, to reflect
We wrote the MAD Skills paper in VLDB 2009 to capture these practices in broad terms. The paper describes the usage patterns mentioned above in more detail. It also includes a fairly technical section with a number of non-trivial analytics techniques adapted from the field, implemented via simple SQL excerpts.
MADlib (MAD Skills, the SQL)
When we released the MAD Skills paper, many people were interested not only in its design aspects, but also in the promise of sophisticated statistical methods in SQL. This interest came from multiple directions: DBMS customers were requesting it of consultants and vendors, and academics were increasingly publishing papers on in-database analytics. What was missing was a software framework to harness the energy of the community, and connect the various interested constituencies.
To this end, a group formed to build MADlib, a free, open-source library of SQL-based algorithms for machine learning, statistics, and related analytic tasks. The methods in MADlib are designed both for in- and out-of-core execution, and for the shared-nothing, “scale-out” parallelism offered by modern parallel database engines, ensuring that computation is done close to the data. The core functionality is written in declarative SQL statements, which orchestrate data movement to and from disk, and across networked machines. Single-node inner loops take advantage of SQL extensibility to call out to high-performance math libraries (currently, Eigen) in user-defined scalar and aggregate functions. At the highest level, tasks that require iteration and/or structure definition are coded in Python driver routines, which are used only to kick off the data-rich computations that happen within the database engine.
The primary goal of the MADlib open-source project is to accelerate innovation and technology transfer in the Data Science community via a shared library of scalable in-database analytics, much as the CRAN library serves the R community. Unlike CRAN, which is customized to the R analytics tool, we hope that MADlib’s grounding in standard SQL can result in community ports to a variety of parallel database engines.
Open-Source Algorithms in Parallel DBMSs?
The state of scalable analytics today depends very much on who you talk to.
The motivation for considering parallel databases comes from both the database market and technology issues. There is a large and growing installed base of massively parallel commercial DBMSs in industry, fueled in part by a recent wave of startup acquisitions. Meanwhile, it is no surprise to database researchers that a massively parallel DBMS is a powerful platform for dataflow programming of sophisticated analytic algorithms. Research on sophisticated in-database analytics has been growing in recent years, in part as an offshoot of work on Probabilistic Databases. Education is hopefully shifting as well. For example, in my own CS186 database course this spring, the students not only wrote traditional SQL queries, they also had to implement a non-trivial social network analysis algorithm in SQL (betweenness centrality).
The open-source nature of MADlib represents a serious commitment by the entire team, and differs from the proprietary approaches traditionally associated with DBMS vendors. The decision to go open-source was motivated by a number of goals, including:
MADlib is still young, at Version 0.3. The initial versions focused on establishing infrastructure and a baseline of textbook and some advanced methods; this initial suite actually covers a fair bit of ground (Table 1). Most methods were chosen because they were frequently requested from customers we met through contacts at Greenplum. More recently, we made a point of validating MADlib as a research vehicle, by fostering a small number of university groups who were working in the area to experiment with the platform and get their code disseminated. Profs. Chris Ré at Wisconsin and Daisy Wang at Florida have written up their work in a MADLib tech report that expands upon this post.
MADlib is currently ported to PostgreSQL (single-node, open-source) and Greenplum (shared-nothing parallel, commercial). Greenplum inherits the PostgreSQL extensibility interfaces almost completely, so these two ports were easy to pursue simultaneously in the early days of the project. Another attraction of Greenplum is that it offers a free download of a massively parallel DBMS for researchers, so there is no limitation on scaling experiments. (This is surprisingly unusual: most DBMS vendors still only advertise free trial downloads of “crippleware” that artificially limits database size or the number of nodes. I would imagine that market forces will change this story relatively soon.)
MADlib is hosted publicly at github, and readers are encouraged to browse the code and documentation via the MADlib website. The initial MADlib codebase reflects contributions from both industry (a team at Greenplum) and academia (Berkeley, Wisconsin, Florida). Project oversight and Quality Assurance efforts have been contributed by Greenplum. Our MADlib TR expands on the architecture and status, and also includes extensive discussion of related work.
At this time, MADlib is ready to consider contributions from additional parties, including both new methods and ports to new platforms. Like any serious open-source project, contributions will have to be managed carefully to maintain code quality. I hope that more researchers will find it worthwhile to contribute serious code to the MADlib effort. It’s a bit more work than getting an algorithm ready to run experiments in a paper, but it’s really satisfying to develop and refine production-quality open-source code, and get it delivered to end-users. If you are doing research on scalable analytic methods, consider going the extra mile and contributing your code to the MADlib effort.
For more information on MADlib, please see the website at http://madlib.net.
Thanks to Chris Ré, Florian Schoppmann and Daisy Wang for their help writing up the recent MADlib TR that this post excerpts, and to Azza Abouzied, Peter Bailis, and Neil Conway for feedback on this version.
Joseph M. Hellerstein is a Chancellor’s Professor of Computer Science at the University of California, Berkeley, whose work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Research Fellow and the recipient of two ACM-SIGMOD “Test of Time” awards for his research. In 2010, Fortune Magazine included him in their list of 50 smartest people in technology , and MIT’s Technology Review magazine included his Bloom language for cloud computing on their TR10 list of the 10 technologies “most likely to change our world”. A past research lab director for Intel, Hellerstein maintains an active role in the high tech industry, currently serving on the technical advisory boards of a number of computing and Internet companies including EMC, SurveyMonkey, Platfora and Captricity.
|Typically I teach around 100 students per year in my introductory database course. This past fall my enrollment was a whopping 60,000. Admittedly, only 25,000 of them chose to submit assignments, and a mere 6500 achieved a strong final score. But even with 6500 students, I more than quadrupled the total number of students I’ve taught in my entire 18-year academic career.
The story begins a couple of years earlier, when Stanford computer science faculty started thinking about shaking up the way we teach. We were tired of delivering the same lectures year after year, often to a half-empty classroom because our classes were being videotaped. (The primary purpose of the videotaping is for Stanford’s Center for Professional Development, but the biggest effect is that many Stanford students skip lectures and watch them later online.) Why not “purpose-build” better videos: shorter, topic-specific segments, punctuated with in-video quizzes to let watchers check their understanding? Then class time could be made more enticing for students and instructor alike, with interactive activities, advanced or exotic topics, and guest speakers. This “flipped classroom” idea was evangelized in the Stanford C.S. department by Daphne Koller; I was one of the early adopters, creating my videos during the first few months of 2011. Recording was a low-tech affair, involving a computer, Cintiq tablet, cheap webcam and microphone, Camtasia software, and a teaching assistant to help with editing.
I put my videos online for the public, and soon realized that with a little extra work, I could make available what amounted to an entire course. With further help from the teaching assistant, I added slides (annotated as lecture notes, and unannotated for teaching use by others), demo scripts, pointers to textbook readings and other course materials, a comprehensive suite of written and programming exercises, and quick-guides for relevant software. The site got a reasonable amount of traffic, but the turning point came when my colleague Sebastian Thrun decided to open up his fall 2011 introductory artificial intelligence course to the world. After one email announcement promising a free online version of the Stanford AI course, including automatically-graded weekly assignments and a “statement of accomplishment” upon completion, Sebastian’s public course garnered tens of thousands of sign-ups within a week.
Having already prepared lots of materials, I jumped on the free-to-the-world bandwagon, as did my colleague Andrew Ng with his machine learning course. What transpired over the next ten weeks was one of the most rewarding things I’ve done in my life. The sign-ups poured in, and soon the “Q&A Forum” was buzzing with activity. The fact that I had a lot of materials ready before the course started turned out to be a bit deceptive—for ten weeks I worked nearly full-time on the course (never mind my other job as department chair, much less my research program), in part because there was a lot to do, but mostly because there was a lot I could do to make it even better, and I was having a grand time.
In addition to the video lectures, in-video quizzes, course materials, and self-guided exercises, I added two very popular components: quizzes that generate different combinations of correct and incorrect answers each time they’re launched (using technology pioneered a decade ago by my colleague Jeff Ullman in his Gradiance system), and interactive workbenches for topics ranging from XML DTD validation to view-update triggers. I offered midterm and final exams—multiple-choice, and crafted carefully so the problems weren’t solvable by running queries or checking Wikipedia. (Creating these exams, at just the right level, turned out to be one of the most challenging tasks of the entire endeavor.) To add a personal touch, and to amplify the strong sense of community that quickly welled up through the Q&A Forum, each week I posted a “screenside chat” video—modeled after Franklin D. Roosevelt’s fireside chats—covering topics ranging from logistical issues, to technical clarifications, to full-on cheerleading for those who were struggling.
Meanwhile back on the campus front, the Stanford students worked through exactly the same materials as the public students (except for the multiple-choice exams), but they did get something more for their money: hand-graded written problems with more depth than the automated exercises, a significant programming project, traditional written exams, and classroom activities ranging from interactive problem-solving to presentations by data architects at Facebook and Twitter. There’s no question that the Stanford students were satisfied: I’ve taught the course enough times to know that the uptick in my teaching ratings was statistically significant.
One interesting and surprisingly large effect of having 60,000 students is the need for absolute perfection: not one tiny flaw or ambiguity goes unnoticed. And when there’s a downright mistake, especially in, say, an exam question … well, I shudder to remember. The task of correcting small (and larger) errors and ambiguities in videos, quizzes, exercises, and other materials, was a continuing chore, but certainly instructive.
What kept me most engaged throughout the course was the attitude of the public students, conveyed primarily through emails and posts on the Q&A Forum. They were unabashedly, genuinely, deeply appreciative. Many said the course was a gift they could scarcely believe had come their way. As the course came to a close, several students admitted to shedding tears. One posted a heartfelt poem. A particularly noteworthy student named Amy became an absolute folk hero: Over the duration of the course Amy answered almost 900 posted questions. Regardless of whether the questions were silly or naive, complex or deep, her answers were patient, correct, of just the right length, included examples as appropriate, and were crafted in perfect English. Amy never revealed anything about herself (although she agreed to visit me after the course was over), despite hundreds of adoring public thank-you’s from her classmates, and one marriage proposal!
So who were these thousands and thousands of students? I ran a survey that revealed some interesting statistics. For example, although ages and occupations spanned the gamut, the largest contingent of students were software professionals wanting to sharpen their job skills. Many students commented that they’d been programming with databases for years without really knowing what they were doing. Males outnumbered females four to one, which is actually a little better than the ratio among U.S. college computer science majors. Students hailed from 130 countries; the U.S. had the highest number by a wide margin, followed by India and Russia. (China unfortunately blocked some of the content, although a few enterprising students helped each other out with workarounds.) On db-class.org you can find the full survey results via the FAQ page, as well as some participation and performance statistics.
Were there any negatives to the experience? Naturally there were a few complainers. For example, in my screenside chats I often referred to the “eager beavers” who were working well ahead of the schedule, and the “procrastinators” who were barely meeting deadlines. Most students enjoyed self-identifying into the categories (some eager-beavers even planned to make T-shirts), but a few procrastinators objected to the term, pointing out that they were squeezing the course between a full-time job or two and significant family obligations. A number of students were disappointed by the low-tech, non-Stanford-endorsed “statement of accomplishment” they received at the end; despite ample warnings from the start, apparently some students were still expecting official certification. I can’t help but wonder if some of those students were the same ones who cheated; I did appear to have quite a number of secondary accounts created expressly for achieving a perfect score. I made it clear from the start that I was assuming students were in it to learn, and cheating was not something I planned to prevent or even think about. Of course in the long run of online education, the interrelated topics of certification and cheating will need to be addressed.
So what happens next? Stanford is launching quite a few more courses in the same style, and I’ll offer mine again next fall. MIT has jumped on the bandwagon; other universities can’t be far behind. Independent enterprises such as the pioneering Khan Academy, and the recently-announced Udacity, are sure to play into the scene. There’s no doubt we’re at a major inflection point in higher education, both on campus and through internet distribution to the world. I’m thrilled to have been an early part of it.
Meanwhile here are a few more numbers: A few months after the initial launch we now have over 100,000 accounts, and we’ve accumulated millions of video views. Even with the course in a self-serve dormant state, each day there are a couple of thousand video views and around 100 assignments submitted for automated grading. All to learn about databases! Wow. Check it out at db-class.org.
|Blogger’s Profile: Jennifer Widom is the Fletcher Jones Professor and (currently) Chair of the Computer Science Department at Stanford University. She was a Research Staff Member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering and the American Academy of Arts & Sciences; she received the ACM SIGMOD Edgar F. Codd Innovations Award in 2007.|