After hanging out exclusively with the database community for 35+ years, I’ve recently become more involved with the systems research community. I have a few observations and recommendations to share.Much of the work published in systems conferences covers topics that would have a natural home in database conferences. For example, transactions and data streams are currently in vogue in the systems field. Here are some recent Best-Paper-award topics: data streams at SOSP 2013, data-parallel query processing at OSDI 2008, and transactions OSDI 2012 and SOSP 2007. Although this topic overlap is increasing lately, it is not just a recent phenomenon. In the last 10+ years, the systems community has taken an interest in data replication, fault tolerance, key-value stores, data-parallel processing (i.e., map-reduce), query processing in sensor networks, and record caching. Some of this work has had a major impact on the database field, e.g., map-reduce and multi-master replication.
Yet despite this overlap of topics, there has been remarkably little attendance of database researchers at systems conferences or of systems researchers at database conferences. Not a good thing. And until 2013, I’ve been part of the problem.
To help break down this barrier, SIGMOD and SIGOPS have sponsored the annual Symposium on Cloud Computing (SoCC) since 2010. Personally, I’ve found attending SoCC’s to be time well spent, and I had a good experience attending ICDCS 2013 in July. So I decided to attend my first SOSP in October, 2013. It was great. The fraction of that program of direct interest to me was as large as any database conference, including work on transactions, data streams, replication, caching, fault tolerance, and scalability. I knew there wouldn’t be a huge number of database attendees, but I wasn’t prepared to be such an odd duck. At most a dozen attendees out of 600 were card-carrying members of the database community. With few database friends to hang out with, I got to meet a lot more new people than I would at a database event and hear about projects in my areas that I’d wouldn’t otherwise have known about. Plus, I could advertise my own work to another group of folks.
If you do systems-oriented database research, then how about picking one systems conference to attend this year? NSDI in Seattle in April, ICDCS in Madrid in late June, or OSDI in Colorado in October. And if you see a systems researcher looking lonely at a database conference, then strike up a conversation. If they feel welcome, perhaps they’ll tell others and we’ll see more systems attendees at database events.
Given the small overlap of attendees in systems and database conferences, we shouldn’t be surprised that the communities have developed different styles of research papers. Since systems papers typically describe mechanisms lower on the stack, they often focus on broader usage scenarios. They value the “ities” more than database papers, e.g., scalability, availability, manageability, and security. They favor simple ideas that work robustly over complex techniques that report improvements for some inputs. They expect a paper to work through the system details in a credible prototype, and to report on lessons learned that are more broadly applicable. They expect more micro-benchmarks that explain the source of performance behavior that’s observed, rather than just benchmarks that model usage scenarios, such as TPC.
From attending systems conferences, serving on SoCC program committees (PC’s), and submitting an ill-fated paper to a systems conference, I learned a few things about systems conferences that we database folks could learn from:
1. More of their conferences are single-track, and they leave more time for Q&A. This leads to more-polished presentations. When you’re presenting a paper to 600 attendees, doing it badly is a career-limiter.
I’ve always liked CIDR, not only because of its system-building orientation, but also because it’s single-track. I’m forced to attend sessions on topics I normally would ignore, and hence I learn more. I think we should try making one of our big conferences single-track: SIGMOD, VLDB or ICDE. One might argue that too many excellent papers are submitted, so it’s impractical. I disagree. Here’s one way to do it: Have the same acceptance rate as in the past, and all papers are published as usual. However, the PC selects only one-track’s worth of papers for presentation slots. The other papers are presented in poster sessions that have no competing parallel sessions. The single-track enables us to learn about more topics, the density of great presentations is higher, and the poster sessions enable us to dig deep on papers of interest to us (as some of our DB conferences already enable us to do). Unless our field stops growing, we really have to do something like this. Our conferences have already gotten out of hand with as many as seven parallel sessions.
2. When describing related work, systems papers cast a wider net. Their goal is often not simply to demonstrate that the paper’s contributions are novel, but also to educate the reader about lines of work that are loosely related to the paper’s topic. To help enable this breadth, they’ve recently settled on a formula where references don’t count toward the paper’s maximum page count. We should adopt this view in the database community. One self-serving benefit is that papers would cite more references, which would increase all of our citation counts.
3. There’s a widespread belief that systems PC’s write longer, more constructive reviews. I think DB PC’s have been improving in this respect, but on average we’re not up to the systems’ standard.
4. They usually shepherd all papers, to ensure that authors incorporate recommended changes. This also gives the authors someone to ask about ambiguous recommendations. We should do this too.
5. They typically require 10pt font. As a courtesy to those of us over a certain age, whose eyes aren’t what they used to be, we should do this, and increase the page count to maintain the same paper length.
6. At SOSP and OSDI, they post all slide decks and videos of all presentations. Obviously worthwhile. We sometimes post slide decks and videos only for plenaries. Let’s do better.
There are a few aspects of systems conferences that I’m less enthusiastic about.
1. Systems conferences favor live PC meetings. They do have benefits. It’s educational for PC members to hear discussions of papers they didn’t review and for young researchers to see how decisions are reached. Since all PC members hear every paper discussion, non-reviewers can bring up points that neutralize inappropriate criticisms or raise issues that escaped reviewers’ attention, which reduces the randomness of decisions. However, there are also negatives. In borderline cases, the most articulate, quick-thinking, and extroverted PC members have an inappropriate advantage in getting their way. And inevitably, some PC members can’t attend the meeting, due to a schedule conflict or the travel expense, so the papers they reviewed get short shrift.
On balance, I don’t think the decisions produced by live PC meetings are enough better than on-line discussions to be worth what they cost. Perhaps we can get some of the benefits of a PC meeting by allowing PC members to see the reviews and discussions of all papers for which they don’t have a conflict, late in the discussion period (which we did some years ago). With large PC’s, this might constitute public disclosure, in which case patents would have to be filed before submission, which will delay the publication of some work. But if we think a broader vetting of papers is beneficial, this might be worth trying.
2. During the reviewing process of a systems conference, if a paper seems borderline, then the PC chairs solicit more reviews. These reviews do offer a different perspective on the interest-value of the work. But they are usually not as detailed as the initial ones and may be by less-expert reviewers. It’s common to receive 5 or 6 reviews of a paper submitted to a systems conference. As an author it’s nice to get this feedback. And it might reduce the randomness of decisions. But it significantly increases the reviewing load on PC’s. In database PC’s, we rarely, if ever, do this. In my opinion, it’s usually better to hold reviewers’ feet to the fire and make them decide. Still, we’d benefit from getting 4 or 5 reviews more often than never, but not as often as systems conferences.
3. Systems conferences publish very few industry papers and rarely have an industry track. In my experience, an industry paper is judged just like a research paper, but with a somewhat lower quality bar. This is unfortunate. Researchers and practitioners benefit from reading about the functionality and internals of state-of-the-art products, even if they are only modestly innovative, especially if they are widely used. The systems community would serve their audience better by publishing more such papers.
4. The reviewing load for systems PC’s is nearly double that of DB PC’s, so systems PC’s are proportionally smaller. E.g., SOSP 2013 had 160 submissions and 28 PC members. If each paper had three reviews, that’s 17 reviews per PC member. With five reviews/paper, that’s 29 reviews per PC member. If DB conferences cranked up the reviewing load, then we’d probably participate in fewer PC’s, so the workload on each of us might be unchanged. A PC member would see more of the submissions in each area of expertise, which might help reduce the randomness of decisions … maybe. Overall, I don’t see a compelling argument to change, but it’s debatable.
5. Like some DB conferences, many systems conferences have adopted double-blind reviewing. I am not a fan. For papers describing a system project, I find it hard to review and, in some cases, impossible to write a paper for a double-blind conference. I have chosen not to submit some papers to double-blind conferences. I know I am not alone in this.
In summary, the systems and DB areas have become much closer in recent years. Each community has something to learn from the other area’s technical contributions and point-of-view and from its approach to conferences and publications. We really would benefit from talking to each other a lot more than we do.
Acknowledgments: My thanks to Gustavo Alonso, Surajit Chaudhuri, Sudipto Das, Sameh Elnikety, Mike Franklin, and Sergey Melnik for contributing some of the ideas in this blog post.
| Blogger’s Profile:
Philip A. Bernstein is a Distinguished Scientist at Microsoft Research. Over the past 35 years, he has been a product architect at Microsoft and Digital Equipment Corp., a professor at Harvard University and Wang Institute of Graduate Studies, and a VP Software at Sequoia Systems. During that time, he has published papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, which are still the major focus of his research. He is an ACM Fellow, a winner of the ACM SIGMOD Innovations Award, a member of the Washington State Academy of Sciences and a member of the National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto.
Archive for Databases
In 2011, I was traveling on the Kolkata Metro when I made an interesting observation. Nearly every passenger around me was using a computing device of some sort. Some were playing with their smartphones, some with their tablets, some listening to music on their iPod Touches. Each was a consumer of structured data of some sort; interacting with their email, Facebook, or music catalog. Weeks later, while on a flight, this pattern came up again — every seat in the plane was equipped with a touch-driven entertainment system, some used while sifting through book pages on e-Readers. Clearly, the notion of a “computer” for the future generation looks very different from the past!
Over time, we seem to be converging to a vision where we are surrounded by computing devices of varying capacities, with users performing both simple and complex tasks, with a common attribute — these devices do not possess keyboards. In 2011, smartphones and tablets outsold workstations and portable computers by 1.5x. In 2013, this ratio is projected to reach 4x. Based on this trend, non-keyboard interaction (typically in the form of gestures) is on-track to be the dominant mode of data interaction.
Gesture-driven applications pose a very different workload to underlying databases than traditional ones. Consider the idea of scrolling through a page of text, which is typically implemented as a
Another example of an interesting gestural interaction is that of the Interactive Join. As part of the GestureDB system we are developing in my group, the GestureQuery multitouch interface lets you compose two relations into an equijoin by simply dragging them together on the specific attributes:
In the interactive join, assuming all attributes are of the same data type, there could be M x N possible
The following interactive demo lets you play around with a real multitouch interaction trace from one of our user studies, collected from interactions with our GestureQuery iPad prototype2. The subject was asked to perform an equijoin
This interaction is an great example of query intent transition. As the gesture is being articulated, the distribution changes rapidly. The “Proximity Score” is the score for
Just like the previous example, challenges arise here. Can a database provide join results at such low latency for such a diverse number of queries being fired at the same time, given such rapidly changing probabilities? It is not practical for the database to execute them all in parallel. Given that we know the domain of possible queries, is there scope for multiquery optimization? Second, given an interactive threshold of say 100ms for the results, is it feasible (or even useful) to generate full results for all queries? Should we consider partial query execution? Or can we simply sample the tables to provide interactive (albeit inaccurate) results? Third, feedback for a query is useful only for a fleeting moment during the gesture articulation. How could we show parts (i.e. rank tuples) of the result that are most useful?
Clearly, when working with gesture-driven interfaces and application layers, the gestural workloads of database queries pose a smorgasbord of challenges, that might require rethinking of the entire database stack. As these interfaces evolve, the underlying database systems will have to keep up with changes in user expectations and demands. With the explosion of new and novel gestural interfaces, natural interfaces and casual computing lately, it’s an exciting time for database research!
| Blogger’s Profile:
Arnab Nandi is an Assistant Professor of Computer Science & Engineering at The Ohio State University. Arnab’s research is in the area of database systems, focusing on challenges in big data analytics and data interaction. The goal of his group’s work is to allow end-users to explore and interact with semi-structured and structured data. Prior to joining Ohio State, Arnab did his Ph.D. at the University of Michigan, Ann Arbor, with stints at Google, Yahoo! Research, and Microsoft Research.
Figure 1 shows a portion of a relational table contained in a real, large information system. The table concerns the customers of an organization, where each row stores data about a single customer. The first column contains her code (if the code is negative, then the record refers to a special customer, called “fictitious”), columns 2 and 3 specify the time interval of validity for the record, ID_GROUP indicates the group the customer belongs to (if the value of FLAG_CP is “S”, then the customer is the leader of the group, and if FLAG_CF is “S”, then the customer is the controller of the group), FATTURATO is the annual turnover (but the value is valid only if FLAG_FATT is “S”). Obviously, each notion mentioned above (like “fictitious”, “group”, “leader”, etc.) has a specific meaning in the organization, and understanding such meaning is crucial if one wants to correctly manage the data in the table and extract information out of it. Similar rules hold for the other 47 columns that, for lack of space, are not shown in the figure.
Figure 1: A portion of the Customer table in a database of a large organization.
Those who have experience of large databases, or databases that are part of large information systems will not be surprised to see such complexity in a single data structure. Now, think of a database with many tables of this kind, and try to imagine a poor final user accessing such tables to extract useful information. The problem is even more severe if one considers that information systems in the real world use different (often many) heterogeneous data sources, both internal and external to the organization. Finally, if we add to the picture the (inevitable) need of dealing with big data, and consider in particular the two v’s of “volume” and “velocity”, we can easily understand why effectively accessing, integrating and managing data in complex organizations is still one of the main issues faced by IT industry nowadays.
Issues in governing complex information systems
The above is a simple example motivating the claim that governing the resources (data, meta-data, services, processes, etc.) of modern information systems is still an outstanding problem. I would like to go in more detail on three important aspects related to this issue.
Accessing and querying data. Although the initial design of a collection of data sources might be adequate, corrective maintenance actions tend to re-shape them into a form that often diverges from the original structure. Also, they are often subject to changes so as to adapt to specific, application-dependent needs. Analogously, applications are continuously modified for accommodating new requirements, and guaranteeing their seamless usage within the organization is costly. The result is that the data stored in different sources and the processes operating over them tend to be redundant, mutually inconsistent, and obscure for large classes of users. So, query formulation often requires interacting with IT experts who know where the data are and what they mean in the various contexts, and can therefore translate the information need expressed by the user into appropriate queries. It is not rare to see organizations where this process requires domain experts to send a request to the data management staff and wait for several days (or even weeks, at least in some Public Administrations in Italy…) before they receive a (possibly inappropriate) query in response. In summary, it is often exceedingly difficult for end users to single out exactly the data that are relevant for them, even though they are perfectly able to describe their requirement in terms of business concepts.
Data quality. It is often claimed that data quality is one of the most important factors in delivering high value information services. However, the above-mentioned scenario poses several obstacles to the goal of even checking data quality, let alone achieving a good level of quality in information delivery. How can we possibly specify data quality requirements, if we do not have a clear understanding of the semantics that data should bring? The problem is sharpened by the need of connecting to external data, originating, for example, from business partners, suppliers, clients, or even public sources. Again, judging about the quality of external data, and deciding whether to reconcile possible inconsistencies or simply adding such data as different views, cannot be done without a deep understanding of their meaning. Note that data quality is also crucial for opening data to external organizations. The demand of greater openness is irresistible nowadays. Private companies are pushed to open their resources to third parties, so as to favor collaborations and new business opportunities. In public administrations, opening up data is underpinning public service reforms in several countries, by offering people informed choices that simply have not existed before, thus driving towards improvements in the services to the citizens.
Process and service specification. Information systems are crucial artifacts for running organizations, and organizations rely not only on data, but also, for instance, on processes and services. Designing, documenting, managing, and executing processes is an important aspect of information systems. However, specifying what a process/service does, or which characteristics it is supposed to have, cannot be done correctly and comprehensively without a clear specification of which data the process will access, and how it will possibly change such data. The difficulties of doing that in a satisfactory way come from various factors, including the lack of modeling languages and tools for describing process and data holistically. However, the problems related to the semantics of data that we discussed above undoubtedly make the task even harder.
A (new?) paradigm
In the last five years, I (and my group in Roma) have been working on a new paradigm addressing these issues, based on the use of knowledge representation and reasoning techniques, and I want to share my excitement about it with the readers of this blog. The paradigm is called “Ontology-based Data Management” (OBDM), and requires structuring the information system into four layers.
The distinguishing feature of the whole approach is that users of the system will be freed from all the details of how to use the resources, as they will express their needs in the terms of the DKB. The system will reason about the DKB and the mappings, and will reformulate the needs in terms of appropriate calls to services provided by resources. Thus, for instance, a user query will be formulated over the domain ontology, and the system will reason upon the ontology and the mappings to call suitable queries over data sources that will compute the answers to the original user query.
As you can see, the heart of the approach is the DKB, and the core of the DKB is the ontology. So, what is new? Indeed, I can almost hear many of you saying: what is the difference with data integration (where the global schema plays the role of the ontology)? And what is the difference with conceptual modeling (where the conceptual schema plays the role of the ontology)? And what about Knowledge Representation in AI (where the axioms of the knowledge base play the role of the DKB)? The answer is simple: almost none. Indeed, OBDA builds on all the above disciplines (and others), but with the goal of going beyond what they currently provide for solving the problems that people encounter in the governance of complex information systems. At the same time, there are a few (crucial, at least for me) facts that make OBDA a novel paradigm to experiment and study. Here is a list of the most important ones:
A lot more to do
A few research groups are experimenting OBDM in practice (see, for example, the Optique IP project, financed by the Seventh Framework Program (FP7) of the European Commission). In Rome, we are involved in applied projects both with Public Administrations, and with private companies. One of the experiences we are carrying out is with the Department of Treasury of the Italian Ministry of Economy and Finance. In this project, three ontology experts from our department worked with three domain experts for six months, and built an ontology of 800 elements, with 3000 DL-Lite axioms, and 800 mapping assertions to about 80 relational tables. The ontology is now used as a common framework for all the applications, and will constitute the main document specifying the requirement for the restructuring of the information system that will be carried out in the next future. We are actually lucky to live in Rome, not only because it is a magnificent city, but also because Italian Public Administrations, many of which are located in the Eternal City, provide perfect examples of all the problems that make OBDM interesting and potentially useful…
The first experiences we have conducted are very promising, but OBDM is a young paradigm, and therefore it needs attention and care. This means that there are many issues to be addressed to make it really effectively work in practice. Let me briefly illustrate some of them. One big issue is how to build and maintain the ontology (and, more generally, the DKB). I know that this is one of the most important criticisms to all the approaches requiring a considerable modeling effort. My answer to these is that all modeling efforts are investments, and when we judge about investments we should talk not only about costs, but also about benefits. Also, take into account that OBDM works in a “pay-as-you-go” fashion: users have interesting advantages even with a very incomplete domain description, as the system can reason about an incomplete specification, and try to get the best out of it. Another important issue is evolution. Evolution in OBDM concerns not only the data at the sources (updates), but also the ontology, and the mappings. Indeed, both the domain description, and the resources continue to evolve, and all the components of the system should keep up with these modifications. Not surprisingly, this is one issue where more research is still desperately needed. Overall, the DKB and the mappings constitute the meta-data of the OBDM system, and in complex organizations, such meta-data can be huge and difficult to control and organize. Talking in terms of a fashionable terminology, with OBDM we face not only the problem of Big Data, but also the problem of Big Meta-Data. Another issue that needs to be further studied and explored is the relationship between the static aspects and the dynamic aspects of the DKB, together with the problem of mapping processes and services specified at the conceptual level to computational resources in applications.
I really hope that this blog have somehow triggered your attention to OBDM, and that you will consider looking at it more closely, for example for carrying out some experiments, or for doing research on at least some of its many open problems that still remain to be studied.
| Blogger’s Profile:
Maurizio Lenzerini is a full professor in Computer Science and Engineering at the University of Rome La Sapienza, where he is leading a research group on Databases and Artificial Intelligence. His main research interests are in database theory, data and service and integration, ontology languages, knowledge representation and reasoning, and component-based software development. He is a former Chair and a current member of the Executive Committee of ACM PODS (Principles of Database Systems). He is an ACM Fellow, an ECAI (European Coordinating Committee for Artificial Intelligence) fellow, and a member of the Academia Europaea – The Academy of Europe.
Big Data should be Interesting Data!
There are various definitions of Big Data; most center around a number of V’s like volume, velocity, variety, veracity – in short: interesting data (interesting in at least one aspect). However, when you look into research papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see here in experimental studies is utterly boring. Performance and scalability experiments are often based on the TPC-H benchmark: completely synthetic data with a synthetic workload that has been beaten to death for the last twenty years. Data quality, data cleaning, and data integration studies are often based on bibliographic data from DBLP, usually old versions with less than a million publications, prolific authors, and curated records. I doubt that this is a real challenge for tasks like entity linkage or data cleaning. So where’s the – interesting – data in Big Data research?
Surely, companies have their own interesting data, and industrial labs have access to such data and real-life workloads. However, all this is proprietary and out of reach for academic research. Therefore, many researchers resort to the good old TPC-H benchmark and DBLP records and call it Big Data. Insights from TPC-H and DBLP are, however, usually not generalizable to interesting and truly challenging data and workloads. Yes, there are positive exceptions; I just refer to a general trend.
Looking Across the Fence: Experimental Data in other Research Communities
Now that I got you alerted, let me be constructive. I have also worked in research communities other than database systems: information retrieval, Web and Semantic Web, knowledge management (yes, a bit of AI), and recently also computational linguistics (aka. NLP). These communities have a different mindset towards data resources and their use in experimental work. To them, data resources like Web corpora, annotated texts, or inter-linked knowledge bases are vital assets for conducting experiments and measuring the progress in the field. These are not static benchmarks that are defined once every ten years; rather, relevant resources are continuously crafted and their role in experiments is continuously re-thought. For example, the IR community has new experimental tasks and competitions in the TREC, INEX, and CLEF conferences each year. Computational linguistics has an established culture of including the availability of data resources and experimental data (such as detailed ground-truth annotations) in the evaluation of submissions to their top conferences like ACL, EMNLP, CoNLL, and LREC. Review forms capture this aspect as an important dimension for all papers, not just for a handful of specific papers tagged Experiments & Analyses.
Even the Semantic Web community has successfully created a huge dataset for experiments: the Web of Linked Data consisting of more than 30 Billion RDF triples from hundreds of data sources with entity-level sameAs linkage across sources. What an irony: ten years ago we database folks thought of Semantic Web people as living in the ivory tower, and now they have more data to play with than we (academic database folks) can dream of.
Towards a Culture Shift in Our Community
Does our community lack the creativity and agility that other communities exhibit? I don’t think so. Rather I believe the problem lies in our publication and experimental culture. Aspects of this topic were discussed in earlier posts on the SIGMOD blog, but I want to address a new angle. We have over-emphasized publications as an achievement by itself: our community’s currency is the paper count rather than the intellectual insight and re-usable contribution. Making re-usable software available is appreciated, but it’s a small point in the academic value system when it comes to hiring, tenure, or promotion decisions. Contributing data resources plays an even smaller role. We need to change this situation by rewarding work on interesting data resources (and equally on open-source software): compiling the data, making it available to the community, and using it in experiments.
There are plenty of good starting points. The Web of Linked Data, with general-purpose knowledge bases (DBpedia, Freebase, Yago) and a wealth of thematically focused high-quality sources (e.g., musicbrainz, geospecies, openstreetmap, etc.), is a great opportunity. This data is huge, structured but highly heterogeneous, and includes substantial parts of uncertain or incomplete nature. Internet archives and Web tables (embedded in HTML pages) are further examples; enormous amounts of interesting data are easily and legally available by crawling or download. Finally, in times when energy, traffic, environment, health, and general sustainability are key challenges on our planet, more and more data by public stakeholders is freely available. Large amounts of structured and statistical data can be accessed at organizations like OECD, WHO, Eurostat, and many others.
Merely pointing to these opportunities is not enough. We must give more incentives that papers do indeed provide new interesting data resources and open-source software. The least thing to do is to extend review reports to include the contribution of novel data and software. A more far-reaching step is to make data and experiments an essential part of the academic currency: how many of your papers contributed data resources, how many contributed open-source software? This should matter in hiring, tenure, and promotion decisions. Needless to say, all this applies to non-trivial, value-adding data resource contributions. Merely converting a relational database into another format is not a big deal.
I believe that computational linguistics is a great role model for experimental culture and the value of data. Papers in premier conferences earn extra credit when accompanied with data resources, and there are highly reputed conferences like LREC which are dedicated to this theme. Moreover, papers of this kind or even the data resources themselves are frequently cited. Why don’t we, the database community, adopt this kind of culture and give data and data-driven experiments the role that they deserve in the era of Big Data?
Is the Grass Always Greener on the Other Side of the Fence?
Some people may argue that rapidly changing setups for data-driven experiments are not viable in our community. In the extreme, every paper could come with its own data resources, making it harder to ensure the reproducibility of experimental results. So we should better stick to established benchmarks like TPC-H and DBLP author cleaning. This is the opponent’s argument. I think the argument that more data resources hinder repeatability is flawed and merely a cheap excuse. Rather, a higher rate of new data resources and experimental setups goes very well with calling upon the authors’ obligation to ensure reproducible results. The key is to make the publication of data and full details of experiments mandatory. This could be easily implemented in the form of supplementary material that accompanies paper submissions and, for accepted papers, would also be archived in the publication repository.
Another argument could be that Big Data is too big to effectively share. However, volume is only one of the criteria for making a dataset Big Data, that is, interesting for research. We can certainly make 100 Gigabytes available for download, and organizations like NIST (running TREC), LDC (hosting NLP data), and the Internet Archive prove that even Terabytes can be shared by asking interested teams to pay a few hundred dollars for shipping disks.
A caveat that is harder to counter is that real-life workloads are so business-critical that they can impossibly be shared. Yes, there were small scandals about query-and-click logs from search engines as they were not properly anonymized. However, the fact that engineers did not do a good job in these cases does not mean that releasing logs and workloads is out of the question. Why would it be impossible to publish a small representative sample of analytic queries over Internet traffic data or advertisement data? Moreover, if we focus on public data hosted by public services, wouldn’t it be easy to share frequently posed queries?
Finally, a critical issue to ponder on is the position of industrial research labs. In the SIGMOD repeatability discussion a few years ago, they made it a point that software cannot be disclosed. Making experimental data available is a totally different issue, and would actually avoid the problem with proprietary software. Unfortunately, we sometimes see papers from industrial labs that show impressive experiments, but don’t give details nor any data and leave zero chance for others to validate the papers’ findings. Such publications that crucially hinge on non-disclosed experiments violate a major principle of good science: the falsifiability of hypotheses, as formulated by the Austrian-British philosopher Karl Popper. So what should industrial research groups do (in my humble opinion)? They should use public data in experiments and/or make their data public (e.g., in anonymized or truncated form, but in the same form that is used in the experiments). Good examples in the past include the N-gram corpora that Microsoft and Google released. Papers may use proprietary data in addition, but when a paper’s contribution lives or dies with a large non-disclosed experiment, the paper cannot be properly reviewed by the open research community. For such papers, which can still be insightful, conferences have industrial tracks.
Last but not least, who could possibly act on this? Or is all this merely public whining, without addressing any stakeholders? An obvious answer is that the steering boards and program chairs of our conferences should reflect and discuss these points. It should not be a complex maneuver to extend the reviewing criteria for the research tracks of SIGMOD, VLDB, ICDE, etc. This would be a step in the right direction. Of course, truly changing the experimental culture in our community and influencing the scholarly currency in the academic world is a long-term process. It is a process that affects all of us, and should be driven by each of you. Give this some thought when writing your next paper with data-intensive experiments.
The above considerations are food for thought, not a recipe. If you prefer a concise set of tenets and recommendations at the risk of oversimplification, here is my bottom line:
Overall, we need a culture shift to encourage more work on interesting data for experimental research in the Big Data wave.
| Blogger’s Profile:
Gerhard Weikum Gerhard Weikum is a Research Director at the Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also an adjunct professor in the Department of Computer Science of Saarland University in Saarbruecken, Germany, and he is a principal investigator of the Cluster of Excellence on Multimodal Computing and Interaction. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He received his diploma and doctoral degrees from the University of Darmstadt, Germany.
Fifty years ago a small team working to automate the business processes of the General Electric Low Voltage Switch Gear Department in Philadelphia built the first functioning prototype of a database management system. The Integrated Data Store was designed by Charles W. Bachman, who later won the ACM’s Turing Award for the accomplishment. He was the first Turing Award winner without a Ph.D., the first with a background in engineering rather than science, and the first to spend his entire career in industry rather than academia.
The exact anniversary of IDS is hard to pin down. Detailed functional specifications for the system were complete by January 1962, and Bachman was presenting details of the planned system to GE customers by May of that year. It is less clear from archival materials when the system first ran, but Bachman’s own recent history of IDS suggests that a prototype was operational by the end of that year.
According to this May 1962 presentation, initial implementation of IDS was expected to be finished by December 1962, with several months of field testing and debugging to follow. Image courtesy of Charles Babbage Institute.
The technical details of IDS, Bachman’s life story, and the context in which it arose have all been explored elsewhere in some detail. He also founded a public company, played a leading role in formulating the OSI seven layer model for data communications, pioneered online transaction processing, and devised the first data modeling notation. Here I am focused on two specific questions:
(1) why do we view IDS as the first database management system, and
(2) what were its similarities and differences with later systems?
There will always be an element of subjectivity in such judgments about “firsts,” particularly as IDS predated the concept of a database management system and so cannot be compared against definitions from the time period. I have elsewhere explored the issue in more depth, stressing the way in which IDS built on early file management and report generation systems and the further evolution of database ideas over the next decade. As a fusty historian I value nuance and am skeptical of the idea that any important innovation can be fully understood by focusing on a single moment of invention.
However, if any system deserves the title of “first database management system” then it is clearly IDS. It served as a model for the earliest definitions of “data base management system” and included most of the core capabilities later associated with the concept.
What was IDS For?
Bachman was not, of course, inspired to create IDS as a contribution to the database research literature. For one thing there was no database research community. At the start of the 1960s computer science was beginning to emerge as an academic field, but its early stars focused on programming language design, theory of computation, numerical analysis, and operating system design. The phrase “data base” was just entering use but was not particularly well established and Bachman’s choice of “data store” would not have seemed any more or less familiar at the time. In contrast to this academic neglect, the efficient and flexible handling of large collections of structured data was the central challenge for what we would now call corporate information systems departments, and was then called business data processing.
During the early 1960s the hype and reality of business computing diverged dramatically. Consultants, visionaries, business school professors, and computer salespeople had all agreed that the best way to achieve real economic payback from computerization was to establish a “totally integrated management information system.” This would integrate and automate all the core operations of a business, ideally with advanced management reporting and simulation capabilities built right in. The latest and most expensive computers of the 1960s had new capabilities that seemed to open the door to a more aggressive approach. Compared to the machines of the 1950s they had relatively large memories, featured disk storage as well as tape drives, could process data more rapidly, and some had even been used to drive interactive terminals in specialized applications. Unfortunately the reality of data processing changed much more slowly, and remained focused on simple administrative applications that batch processed large files of records to accomplish discrete tasks such as weekly payroll processing, customer statement generation, or accounts payable reports.
Many companies announced their intention to build totally integrated management information systems, but few ever claimed significant success. A modern reader would not be shocked to learn that firms were unable to create systems of comparable scope to today’s Enterprise Resources Planning and data warehouse projects using computers with perhaps the equivalent of 64KB of memory, no real operating system, and a few megabytes of disk storage. Still, even partially integrated systems covering significant portions of a business with flexible reporting capabilities would have real value. The biggest challenges to even modest progress towards this goal were the sharing of data between applications and the effective use of random access disk storage by application programmers.
Data processing techniques had evolved directly from those used with pre-computer mechanical punched card machines. The concepts of files, fields, keys, grouping, merging data from two files, and the hierarchical combination of master and detail records within a single file all predated electronic computers. These worked with magnetic tape much as they had done with punched cards, except that sorting was actually much harder with tape. Getting a complex job done might involve dozens of small programs and the generation of many working tapes full of intermediate data. These banks of whirring tape drives provided computer centers with their main source of visual interest in the movies of the era. However the formats of tape files were very inflexible and were usually fixed by the code of the application programs working with the data. Every time a field was added or changed all the programs working with the file would need to be rewritten. But if applications were integrated, for example by having order records from the sales accounting system automatically exported as input for the production scheduling application, then the resulting web of dependencies would make it ever harder to carry out even minor changes in response to shifting business needs.
This 1962 diagram, drawn by Stanley Williams, sketched the complex dependencies between different records involved in the production planning process. Courtesy Charles Babbage Institute.
The other key challenge was making effective use of random access storage in business application programs. Sequential tape storage was conceptually simple, and the tape drives themselves provided some intelligence to aid programmers in reading or writing records. But the only really practical computer applications were batch-oriented because searching a tape to find or update a particular record was too slow to be practical. Instead, master files were periodically updated with accumulated data or read through to produce reports. The arrival in the early 1960s of disk storage a programmer theoretically made it possible to apply updates one at a time as new data came in, or to create interactive systems that could respond to requests immediately. A programmer could easily instruct the drive to pull data from any particular platter or track, but the hard part was figuring out where on the disk the desired record could be found. Harnessing the power of the new technology meant finding ways to order, insert, delete, or search for records that did not simply replicate the sequential techniques used with tape. Solutions such as indexing, inverted files, hashing, linked lists, chains and so on were quickly devised but these were relatively complex to implement and demanded expert judgment to select the best method for a particular task. In addition, application programmers were beginning to shift from assembly language to high level languages such as COBOL. Business oriented languages included high level support for working with structuring data in tape files but lacked comparable support for random access storage. Without significant disk file management support from the rudimentary operating systems of the era only elite programmers could hope to create of an efficient random access application.
This image, from a 1962 internal General Electric document, conveyed the idea of random access storage using a set of “pigeon holes” in which data could be placed. Courtesy of Charles W. Bachman.
IDS was intended to substantially solve these two problems, so that applications could be integrated to share data files and ordinary programmers could effectively develop random access applications using high level languages. Bachman designed it to meet the needs of an integrated systems project run as an experimental prototype within General Electric by the group of systems-minded specialists he was working for at its corporate headquarters. General Electric had many factories spread over its various divisions, and could not produce a different integrated system for each one. Furthermore it was entering the computer business, and recognized that a flexible and generic integrated system based on disk storage would be a powerful tool in selling its machines to other companies.
Was IDS a Data Base Management System?
IDS carried out what we still consider the core task of a database management system by interposing itself itself between application programs and the files in which they stored data. Programs could not manipulate data files directly, instead making calls to IDS so that it would perform the rested operation on their behalf.
Like modern database management systems, IDS explicitly stored and manipulated metadata about the records and their relationships, rather than expecting each application program to understand and respect the format of every data file it worked with. It could enforce relationships between different record types, and would protect database integrity. Database designers would specify indexes and other details of record organization to boost performance based on expected usage patterns. However the first versions it did not include a formal data manipulation language. Instead of being defined through textual commands the metadata was punched onto specially formatted input cards. A special command told IDS to read and apply this information.
IDS was designed to be used with high level programming languages. In the initial prototype version, operational in 1962, this was General Electric’s own GECOM language, though performance and memory concerns drove Bachman’s team to shift to assembly language for the application programming in a higher performance version completed in 1964. Part of IDS remained resident in memory while application programs were executed. Calls to IDS operations such as store, retrieve, modify, and delete were interpreted at runtime against the latest metadata and then executed. As high level languages matured and memory grew less scarce, later version of IDS were oriented towards application programs written in COBOL.
This provided a measure of what is now called data independence for programs. If a file was restructured to add fields or modify their length then the programs using it would continue to work properly. Files could be moved around and reorganized without rewriting application programs. That made running different application programs against the same database much more feasible. IDS also included its own system of paging data in and out of memory, to create a virtual memory capability transparent to the application programmer.
The concept of transactions is fundamental to modern database management systems. Programmers specify that a series of interconnected updates must take place together, so that if one fails or is undone they all are. IDS was also transaction oriented, though not in exactly the same sense. It took over the entire computer, which had only 8,000 words of memory. Bachman devised an innovative transaction processing system, which he called the Problem Controller. Bachman’s original version of IDS was not loaded by application programs when needed to handle a data operation. Instead IDS reversed the relationship: it ran when the computer booted and loaded application programs as needed. Only one application program ran at a time. Requests from users to run particular programs were read from “problem control cards” and buffered as IDS records. The computer worked its way through the queue of requests, updating it after each job was finished. By 1965 an improved version of this system was in use at Weyerhauser, on a computer hooked up to a national teletype network. Requests for reports and data submissions were inserted directly into the queue by remote users.
Bachman’s original prototypes lacked strong backup and recovery systems, key features of later database management systems, but this was added as early as 1964 when IDS was first being prepared as a package for distribution to General Electric’s customers. A recovery tape logged memory pages modified by each transaction, so that the database could be restored to a consistent state if something went wrong before the transaction was completed. The same tape served as an incremental backup of changes since the last full backup.
This first packaged version of IDS did lack some features later viewed as essential for database management systems. One was the idea that specific users could be granted or denied access to particular parts of the database. This was related to another limitation: IDS databases could be queried or modified only by writing and executing programs in which IDS calls were included. There was no interactive capability to request “ad hoc” reports or run one-off queries without having to write a program. Easy to use report generator systems (such as 9PAC and MARK IV) and online interactive data management systems (such as TDMS) were created during the 1960s but they were generally seen as a separate class of software from data base management systems. (By the 1970s these packages were still popular, but included optional modules to interface with data stored in database management systems).
IDS and CODASYL
After Bachman handed IDS over to a different team within General Electric in 1964 it was made available as a documented and supported software package for the company’s 200 series computers. Later versions supported its 400 and 600 series systems. New versions followed in the 1970s after Honeywell brought out General Electric’s computer business. IDS was a strong product, in many respects more advanced than IBM’s IMS which appeared several years later. However IBM machines dominated the industry so software from other manufacturers was doomed to relative obscurity whatever its merits. In those days software packages from computer manufactures were paid for by hardware sales and given to customers without an additional charge.
During the late 1960s the ideas Bachman created for IDS were taken up by the Database Task Group of CODASYL, a standards body for the data processing industry best known for its creation and promotion of the COBOL language. Its 1969 report drew heavily on IDS in defining a proposed standard for database management systems, in part thanks to Bachman’s own service on the committee. In retrospect, the committee’s work, and a related effort by CODASYL’s Systems Committee to evaluate existing systems within the new framework, were significant primarily for formulating and spreading the concept of a “data base management system.”
CODASYL’s definition of the architecture of a database management system and its core capabilities was quite close to that included in textbooks to this day. In particular, it suggested that a data base management system should support on-line, interactive applications as well as batch driven applications and have separate interfaces. COSASYL’s initial report, published in 1969, documented foundational concepts and vocabulary such as data definition language, data manipulation language, schemas, data independence, and program independence. It went beyond early versions of IDS by adding security features, including the idea of “privacy locks” and included “sub-schemas,” roughly equivalent to views in relational systems, so that different programs could work with specially presented subsets of the overall content of the database.
Although IBM itself refused to support the CODASYL approach, continuing to favor its own IMS with its simple hierarchical data model, many other computer vendors supported its recommendations and eventually produced systems incorporating these features. The most successful CODASYL system, IDMS, came from an independent software company and began as a port of IDS to IBM’s dominant System/360 mainframe platform.
The Legacy of IDS
IDS and CODASYL systems did not use the relational data model, formulated years later by Ted Codd, which underlies today’s dominant SQL database management systems. Instead it introduced what would later be called the “network data model.” This encoded relationships between different kinds of records as a graph, rather than the strict hierarchy enforced by tape systems and some other software packages of the 1960s such as IBM’s later and widely used IMS. The network data model was widely used during the 1970s and 1980s, and commercial database management systems based on this approach were among the most successful products of the mushrooming packaged software industry.
Bachman spoke memorably in his Turing Award lecture of the “Programmer as Navigator,” charting a path through the database from one record to another. The IDS approach required programmers to work with records one at a time. Performing the same operation on multiple records mean retrieving a retrieving a record, processing and if necessary updating it, and then moving on to the next record of interest to repeat the process. For some tasks this made programs longer and more cumbersome than the equivalent in a relational system, where a task such as deleting all records more than a year old or adding 10% to the sales price of every item could be performed with a single command. In addition, IDS and other network systems encoded what we now think of as the “joins” between different kinds of records as part of the database structure rather than specifying them in each query. This made IDS much less flexible than later relational systems, but also much simpler to implement and more efficient for routine operations.
This drawing, from the 1962 presentation “IDS: The Information Processing Machine We Need,” shows the use of chains to connect record. The programmer used GET commands to navigate between related records.
IDS was a useful and practical tool for business use in the early 1960s, while relational systems were not commercially available until the 1980s. Relational systems did not become feasible until computers were orders of magnitude more powerful than they had been in 1962 and some extremely challenging implementation issues had been overcome. Even after relational systems were commercialized the two approaches were seen for some time as complementary, with network systems used for high performance transaction processing systems handling routine operations on large numbers of records (for example credit card processing) and relational systems best suited for flexible “decision support” data crunching. Although IDMS is still in use for some few very large applications it, and other database management systems based on Bachman’s network data model, have long since been superseded for new applications and for mainstream computing needs.
Still, without IDS and Bachman’s tireless championing of the ideas it contained the very concept of a “database management system” might never have taken root in the first place. When database specialists look at IDS today it is easy to see its limitations compared to modern systems. Its strengths are easy to miss because its huge influence on the software industry meant that much of what was revolutionary about it in 1962 was soon taken for granted. IDS did more than any other single piece of software to broaden the range of business problems to which computers could usefully be applied and so to usher in today’s world where every administrative transaction involves a flurry of database queries and updates rather than the filing of forms completed in triplicate.
| Blogger’s Profile:
Thomas Haigh is an Associate Professor of Information Studies at the University of Wisconsin–Milwaukee. He chairs SIGCIS, the group for historians of information technology, and has published widely on different aspects of the history of computing. Learn more at www.tomandmaria.com/tom