Big Data should be Interesting Data!
There are various definitions of Big Data; most center around a number of V’s like volume, velocity, variety, veracity – in short: interesting data (interesting in at least one aspect). However, when you look into research papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see here in experimental studies is utterly boring. Performance and scalability experiments are often based on the TPC-H benchmark: completely synthetic data with a synthetic workload that has been beaten to death for the last twenty years. Data quality, data cleaning, and data integration studies are often based on bibliographic data from DBLP, usually old versions with less than a million publications, prolific authors, and curated records. I doubt that this is a real challenge for tasks like entity linkage or data cleaning. So where’s the – interesting – data in Big Data research?
Surely, companies have their own interesting data, and industrial labs have access to such data and real-life workloads. However, all this is proprietary and out of reach for academic research. Therefore, many researchers resort to the good old TPC-H benchmark and DBLP records and call it Big Data. Insights from TPC-H and DBLP are, however, usually not generalizable to interesting and truly challenging data and workloads. Yes, there are positive exceptions; I just refer to a general trend.
Looking Across the Fence: Experimental Data in other Research Communities
Now that I got you alerted, let me be constructive. I have also worked in research communities other than database systems: information retrieval, Web and Semantic Web, knowledge management (yes, a bit of AI), and recently also computational linguistics (aka. NLP). These communities have a different mindset towards data resources and their use in experimental work. To them, data resources like Web corpora, annotated texts, or inter-linked knowledge bases are vital assets for conducting experiments and measuring the progress in the field. These are not static benchmarks that are defined once every ten years; rather, relevant resources are continuously crafted and their role in experiments is continuously re-thought. For example, the IR community has new experimental tasks and competitions in the TREC, INEX, and CLEF conferences each year. Computational linguistics has an established culture of including the availability of data resources and experimental data (such as detailed ground-truth annotations) in the evaluation of submissions to their top conferences like ACL, EMNLP, CoNLL, and LREC. Review forms capture this aspect as an important dimension for all papers, not just for a handful of specific papers tagged Experiments & Analyses.
Even the Semantic Web community has successfully created a huge dataset for experiments: the Web of Linked Data consisting of more than 30 Billion RDF triples from hundreds of data sources with entity-level sameAs linkage across sources. What an irony: ten years ago we database folks thought of Semantic Web people as living in the ivory tower, and now they have more data to play with than we (academic database folks) can dream of.
Towards a Culture Shift in Our Community
Does our community lack the creativity and agility that other communities exhibit? I don’t think so. Rather I believe the problem lies in our publication and experimental culture. Aspects of this topic were discussed in earlier posts on the SIGMOD blog, but I want to address a new angle. We have over-emphasized publications as an achievement by itself: our community’s currency is the paper count rather than the intellectual insight and re-usable contribution. Making re-usable software available is appreciated, but it’s a small point in the academic value system when it comes to hiring, tenure, or promotion decisions. Contributing data resources plays an even smaller role. We need to change this situation by rewarding work on interesting data resources (and equally on open-source software): compiling the data, making it available to the community, and using it in experiments.
There are plenty of good starting points. The Web of Linked Data, with general-purpose knowledge bases (DBpedia, Freebase, Yago) and a wealth of thematically focused high-quality sources (e.g., musicbrainz, geospecies, openstreetmap, etc.), is a great opportunity. This data is huge, structured but highly heterogeneous, and includes substantial parts of uncertain or incomplete nature. Internet archives and Web tables (embedded in HTML pages) are further examples; enormous amounts of interesting data are easily and legally available by crawling or download. Finally, in times when energy, traffic, environment, health, and general sustainability are key challenges on our planet, more and more data by public stakeholders is freely available. Large amounts of structured and statistical data can be accessed at organizations like OECD, WHO, Eurostat, and many others.
Merely pointing to these opportunities is not enough. We must give more incentives that papers do indeed provide new interesting data resources and open-source software. The least thing to do is to extend review reports to include the contribution of novel data and software. A more far-reaching step is to make data and experiments an essential part of the academic currency: how many of your papers contributed data resources, how many contributed open-source software? This should matter in hiring, tenure, and promotion decisions. Needless to say, all this applies to non-trivial, value-adding data resource contributions. Merely converting a relational database into another format is not a big deal.
I believe that computational linguistics is a great role model for experimental culture and the value of data. Papers in premier conferences earn extra credit when accompanied with data resources, and there are highly reputed conferences like LREC which are dedicated to this theme. Moreover, papers of this kind or even the data resources themselves are frequently cited. Why don’t we, the database community, adopt this kind of culture and give data and data-driven experiments the role that they deserve in the era of Big Data?
Is the Grass Always Greener on the Other Side of the Fence?
Some people may argue that rapidly changing setups for data-driven experiments are not viable in our community. In the extreme, every paper could come with its own data resources, making it harder to ensure the reproducibility of experimental results. So we should better stick to established benchmarks like TPC-H and DBLP author cleaning. This is the opponent’s argument. I think the argument that more data resources hinder repeatability is flawed and merely a cheap excuse. Rather, a higher rate of new data resources and experimental setups goes very well with calling upon the authors’ obligation to ensure reproducible results. The key is to make the publication of data and full details of experiments mandatory. This could be easily implemented in the form of supplementary material that accompanies paper submissions and, for accepted papers, would also be archived in the publication repository.
Another argument could be that Big Data is too big to effectively share. However, volume is only one of the criteria for making a dataset Big Data, that is, interesting for research. We can certainly make 100 Gigabytes available for download, and organizations like NIST (running TREC), LDC (hosting NLP data), and the Internet Archive prove that even Terabytes can be shared by asking interested teams to pay a few hundred dollars for shipping disks.
A caveat that is harder to counter is that real-life workloads are so business-critical that they can impossibly be shared. Yes, there were small scandals about query-and-click logs from search engines as they were not properly anonymized. However, the fact that engineers did not do a good job in these cases does not mean that releasing logs and workloads is out of the question. Why would it be impossible to publish a small representative sample of analytic queries over Internet traffic data or advertisement data? Moreover, if we focus on public data hosted by public services, wouldn’t it be easy to share frequently posed queries?
Finally, a critical issue to ponder on is the position of industrial research labs. In the SIGMOD repeatability discussion a few years ago, they made it a point that software cannot be disclosed. Making experimental data available is a totally different issue, and would actually avoid the problem with proprietary software. Unfortunately, we sometimes see papers from industrial labs that show impressive experiments, but don’t give details nor any data and leave zero chance for others to validate the papers’ findings. Such publications that crucially hinge on non-disclosed experiments violate a major principle of good science: the falsifiability of hypotheses, as formulated by the Austrian-British philosopher Karl Popper. So what should industrial research groups do (in my humble opinion)? They should use public data in experiments and/or make their data public (e.g., in anonymized or truncated form, but in the same form that is used in the experiments). Good examples in the past include the N-gram corpora that Microsoft and Google released. Papers may use proprietary data in addition, but when a paper’s contribution lives or dies with a large non-disclosed experiment, the paper cannot be properly reviewed by the open research community. For such papers, which can still be insightful, conferences have industrial tracks.
Last but not least, who could possibly act on this? Or is all this merely public whining, without addressing any stakeholders? An obvious answer is that the steering boards and program chairs of our conferences should reflect and discuss these points. It should not be a complex maneuver to extend the reviewing criteria for the research tracks of SIGMOD, VLDB, ICDE, etc. This would be a step in the right direction. Of course, truly changing the experimental culture in our community and influencing the scholarly currency in the academic world is a long-term process. It is a process that affects all of us, and should be driven by each of you. Give this some thought when writing your next paper with data-intensive experiments.
The above considerations are food for thought, not a recipe. If you prefer a concise set of tenets and recommendations at the risk of oversimplification, here is my bottom line:
Overall, we need a culture shift to encourage more work on interesting data for experimental research in the Big Data wave.
| Blogger’s Profile:
Gerhard Weikum Gerhard Weikum is a Research Director at the Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also an adjunct professor in the Department of Computer Science of Saarland University in Saarbruecken, Germany, and he is a principal investigator of the Cluster of Excellence on Multimodal Computing and Interaction. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He received his diploma and doctoral degrees from the University of Darmstadt, Germany.
Computer science publication culture and practices has become an active discussion topic. Moshe Vardi has written a number of editorials in Communications of ACM on the topic that can be found here and here, and these have generated considerable discussion. The conversation on this issue has been expanding and Jagadish has collected the writings on Scholarly Publications for CRA, which is a valuable resource.
The database community has pioneered discussions on publication issues. We have had panels at conferences, discussions during business meetings, informal conversations during conferences, discussions within SIGMOD Executive and the VLDB Endowment Board – we have been at this since about 2000. I wrote about one aspect of this back in 2002 in my SIGMOD Chair‘s message.
The initial conversation in the database community was due to the significant increase in the number of submitted papers to our conferences that we were experiencing year-after-year. The increasing number of submissions had started to severely stress our ability to meaningfully manage the conference reviewing process. It became quite clear, quite quickly, to a number of us that the overriding problem was our over-reliance on conferences that were not designed to fulfill the role that we were pushing them to play: being the final archival publication venues. I argued this point in my 2002 SIGMOD Chair’s message that I mentioned above. I ended that message by stating that we “have been very successful over the years in convincing tenure and promotion committees and university bodies about the value of the conferences (rightfully so), we now have to convince ourselves that journals are equally valuable and important venues to publish fuller research results.” The same topic was the focus of my presentation on the panel on “Paper and Proposal Reviews: Is the Process Flawed?” that Hank Korth organized at the 2008 CRA Snowbird Conference (the report of the panel appeared in SIGMOD Record and can be accessed here).
This discussion needs to start with our objectives. In an ideal world, what we want are:
The conventional wisdom is that conferences are superior on the first two points and the third point is something we can tinker with (and we have been tinkering with for quite a while with mixed results) while the fourth objective is addressed by a combination of increasing conference paper page limits, decreasing font sizes so we can pack more material per page, and the practice of submitting fuller versions of conference papers to journals. Data suggest that the first issue does not hold – our top journals now have first round review times that are competitive with “traditional” conferences (e.g., SIGMOD and ICDE). The second issue can be addressed by adopting a publication business model that relies primarily on on-line dissemination with print copies released once per volume – this way you don’t wait for print processing, nor do you have to worry about page budgets and the like. Note that I am not talking about “online-first” models, but actually publishing the final version of the paper online as soon as the final version can be produced after acceptance. Journals perform much better on the last two points.
In my view, in the long run, we will follow other science and engineering disciplines and start treating journals as the main outlet for disseminating our research results. However, the road from here to there is not straightforward and there are a number of alternatives that we can follow. Accepting the fact that we, as a community, are not yet willing to give up on the conference model of publication, what are some of the measures we can take? Here are some suggestions:
These are things that we currently do – Proceedings of VLDB (PVLDB) incorporates these suggestions. It represents the current thinking of the VLDB Endowment Board after many years of discussions. Although I had some reservations at the beginning, I have become convinced that it is better than our traditional conferences. However, I am suggesting going further:
As I said earlier, my personal belief is that we will eventually shift our focus to journal publications. What I outlined above is a set of policies we can adopt to move in that direction. For an open membership organization such as SIGMOD, making major changes such as these requires full engagement of the membership. I hope we start discussing.
| Blogger’s Profile:
M. Tamer Özsu is Professor of Computer Science at the David R. Cheriton School of Computer Science of the University of Waterloo. He was the Director of the Cheriton School of Computer Science from January 2007 to June 2010. His research is in data management focusing on large-scale data distribution and management of non-traditional data. His publications include the book Principles of Distributed Database Systems (with Patrick Valduriez), which is now in its third edition. He has also edited, with Ling Liu, the Encyclopedia of Database Systems. He serves as the Series Editor of Synthesis Lectures on Data Management (Morgan & Claypool) and on the editorial boards of three journals, and two book Series. He is a Fellow of the Association for Computing Machinery (ACM), and of the Institute of Electrical and Electronics Engineers (IEEE), and a member of Sigma Xi.