March 6, 2013
There are various definitions of Big Data; most center around a number of V’s like volume, velocity, variety, veracity – in short: interesting data (interesting in at least one aspect). However, when you look into research papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see here in experimental studies is utterly boring. Performance and scalability experiments are often based on the TPC-H benchmark: completely synthetic data with a synthetic workload that has been beaten to death for the last twenty years. Data quality, data cleaning, and data integration studies are often based on bibliographic data from DBLP, usually old versions with less than a million publications, prolific authors, and curated records. I doubt that this is a real challenge for tasks like entity linkage or data cleaning. So where’s the – interesting – data in Big Data research?
Surely, companies have their own interesting data, and industrial labs have access to such data and real-life workloads. However, all this is proprietary and out of reach for academic research. Therefore, many researchers resort to the good old TPC-H benchmark and DBLP records and call it Big Data. Insights from TPC-H and DBLP are, however, usually not generalizable to interesting and truly challenging data and workloads. Yes, there are positive exceptions; I just refer to a general trend.
Now that I got you alerted, let me be constructive. I have also worked in research communities other than database systems: information retrieval, Web and Semantic Web, knowledge management (yes, a bit of AI), and recently also computational linguistics (aka. NLP). These communities have a different mindset towards data resources and their use in experimental work. To them, data resources like Web corpora, annotated texts, or inter-linked knowledge bases are vital assets for conducting experiments and measuring the progress in the field. These are not static benchmarks that are defined once every ten years; rather, relevant resources are continuously crafted and their role in experiments is continuously re-thought. For example, the IR community has new experimental tasks and competitions in the TREC, INEX, and CLEF conferences each year. Computational linguistics has an established culture of including the availability of data resources and experimental data (such as detailed ground-truth annotations) in the evaluation of submissions to their top conferences like ACL, EMNLP, CoNLL, and LREC. Review forms capture this aspect as an important dimension for all papers, not just for a handful of specific papers tagged Experiments & Analyses.
Even the Semantic Web community has successfully created a huge dataset for experiments: the Web of Linked Data consisting of more than 30 Billion RDF triples from hundreds of data sources with entity-level sameAs linkage across sources. What an irony: ten years ago we database folks thought of Semantic Web people as living in the ivory tower, and now they have more data to play with than we (academic database folks) can dream of.
Does our community lack the creativity and agility that other communities exhibit? I don’t think so. Rather I believe the problem lies in our publication and experimental culture. Aspects of this topic were discussed in earlier posts on the SIGMOD blog, but I want to address a new angle. We have over-emphasized publications as an achievement by itself: our community’s currency is the paper count rather than the intellectual insight and re-usable contribution. Making re-usable software available is appreciated, but it’s a small point in the academic value system when it comes to hiring, tenure, or promotion decisions. Contributing data resources plays an even smaller role. We need to change this situation by rewarding work on interesting data resources (and equally on open-source software): compiling the data, making it available to the community, and using it in experiments.
There are plenty of good starting points. The Web of Linked Data, with general-purpose knowledge bases (DBpedia, Freebase, Yago) and a wealth of thematically focused high-quality sources (e.g., musicbrainz, geospecies, openstreetmap, etc.), is a great opportunity. This data is huge, structured but highly heterogeneous, and includes substantial parts of uncertain or incomplete nature. Internet archives and Web tables (embedded in HTML pages) are further examples; enormous amounts of interesting data are easily and legally available by crawling or download. Finally, in times when energy, traffic, environment, health, and general sustainability are key challenges on our planet, more and more data by public stakeholders is freely available. Large amounts of structured and statistical data can be accessed at organizations like OECD, WHO, Eurostat, and many others.
Merely pointing to these opportunities is not enough. We must give more incentives that papers do indeed provide new interesting data resources and open-source software. The least thing to do is to extend review reports to include the contribution of novel data and software. A more far-reaching step is to make data and experiments an essential part of the academic currency: how many of your papers contributed data resources, how many contributed open-source software? This should matter in hiring, tenure, and promotion decisions. Needless to say, all this applies to non-trivial, value-adding data resource contributions. Merely converting a relational database into another format is not a big deal.
I believe that computational linguistics is a great role model for experimental culture and the value of data. Papers in premier conferences earn extra credit when accompanied with data resources, and there are highly reputed conferences like LREC which are dedicated to this theme. Moreover, papers of this kind or even the data resources themselves are frequently cited. Why don’t we, the database community, adopt this kind of culture and give data and data-driven experiments the role that they deserve in the era of Big Data?
Some people may argue that rapidly changing setups for data-driven experiments are not viable in our community. In the extreme, every paper could come with its own data resources, making it harder to ensure the reproducibility of experimental results. So we should better stick to established benchmarks like TPC-H and DBLP author cleaning. This is the opponent’s argument. I think the argument that more data resources hinder repeatability is flawed and merely a cheap excuse. Rather, a higher rate of new data resources and experimental setups goes very well with calling upon the authors’ obligation to ensure reproducible results. The key is to make the publication of data and full details of experiments mandatory. This could be easily implemented in the form of supplementary material that accompanies paper submissions and, for accepted papers, would also be archived in the publication repository.
Another argument could be that Big Data is too big to effectively share. However, volume is only one of the criteria for making a dataset Big Data, that is, interesting for research. We can certainly make 100 Gigabytes available for download, and organizations like NIST (running TREC), LDC (hosting NLP data), and the Internet Archive prove that even Terabytes can be shared by asking interested teams to pay a few hundred dollars for shipping disks.
A caveat that is harder to counter is that real-life workloads are so business-critical that they can impossibly be shared. Yes, there were small scandals about query-and-click logs from search engines as they were not properly anonymized. However, the fact that engineers did not do a good job in these cases does not mean that releasing logs and workloads is out of the question. Why would it be impossible to publish a small representative sample of analytic queries over Internet traffic data or advertisement data? Moreover, if we focus on public data hosted by public services, wouldn’t it be easy to share frequently posed queries?
Finally, a critical issue to ponder on is the position of industrial research labs. In the SIGMOD repeatability discussion a few years ago, they made it a point that software cannot be disclosed. Making experimental data available is a totally different issue, and would actually avoid the problem with proprietary software. Unfortunately, we sometimes see papers from industrial labs that show impressive experiments, but don’t give details nor any data and leave zero chance for others to validate the papers’ findings. Such publications that crucially hinge on non-disclosed experiments violate a major principle of good science: the falsifiability of hypotheses, as formulated by the Austrian-British philosopher Karl Popper. So what should industrial research groups do (in my humble opinion)? They should use public data in experiments and/or make their data public (e.g., in anonymized or truncated form, but in the same form that is used in the experiments). Good examples in the past include the N-gram corpora that Microsoft and Google released. Papers may use proprietary data in addition, but when a paper’s contribution lives or dies with a large non-disclosed experiment, the paper cannot be properly reviewed by the open research community. For such papers, which can still be insightful, conferences have industrial tracks.
Last but not least, who could possibly act on this? Or is all this merely public whining, without addressing any stakeholders? An obvious answer is that the steering boards and program chairs of our conferences should reflect and discuss these points. It should not be a complex maneuver to extend the reviewing criteria for the research tracks of SIGMOD, VLDB, ICDE, etc. This would be a step in the right direction. Of course, truly changing the experimental culture in our community and influencing the scholarly currency in the academic world is a long-term process. It is a process that affects all of us, and should be driven by each of you. Give this some thought when writing your next paper with data-intensive experiments.
The above considerations are food for thought, not a recipe. If you prefer a concise set of tenets and recommendations at the risk of oversimplification, here is my bottom line:
1. Academic research on Big Data is excessively based on boring data and nearly trivial workloads. On the other hand, Big Data research aims to obtain insights from interesting data and cope with demanding workloads. This is a striking mismatch.
2. Other communities, like IR, NLP, or Web research, have a much richer and agile culture in creating, disseminating, and re-using interesting new data resources for scientific experimentation. Research that provides new data resources (and software) earns extra credit.
3. We should follow that path and give more weight to data resources, open-source software, and experimental details that follow-up research can build on. Supplementary material that accompanies publications is one way to pursue this.
4. In addition, we need to incentivize and reward the creation and contribution of data resources as an asset for the research community. This should affect the value system used for paper refereeing and also for hiring, tenure, and promotion decisions.
Overall, we need a culture shift to encourage more work on interesting data for experimental research in the Big Data wave.
Gerhard Weikum Gerhard Weikum is a Research Director at the Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also an adjunct professor in the Department of Computer Science of Saarland University in Saarbruecken, Germany, and he is a principal investigator of the Cluster of Excellence on Multimodal Computing and Interaction. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He received his diploma and doctoral degrees from the University of Darmstadt, Germany.
Copyright @ 2013, Gerhard Weikum, All rights reserved.
Comments are closed
I couldn’t agree more. In the database group at UW, we are building repositories of use-cases for the purpose of facilitating experimental evaluations of Big Data systems and techniques. We have a repository of various Hadoop workloads here (jointly with Duke):
http://nuage.cs.washington.edu/repository.php
We are now putting together a larger-scale repository (with larger-scale use-cases). It will be available through here:
http://myria.cs.washington.edu/
We plan to post both real use-cases and various benchmarks since the latter also serve a purpose.
Very interesting blog – thanks for sharing!
I have one comment regarding the “ivory tower” – the Semanic Web is about transforming the Web by creating the foundations for linked data and networked knowledge – and getting them used. As many things the Semantic Web effort was often not taken seriously initially. Explaining the Semantic Web in 2001 was equivalent to traveling back in time and explaining how the Web works to the World of 1989. Now has reached critical mass, fuelling self-propelled growth. So the Semantic Web was not in an ivory tower – it has good ideas which needed to transition into reality, since it is – like the Web – a social technology exploiting Metcalfs law.
Thanks, Stefan.
This remark was meant ironically:
“we database folks thought …”.
I hope it didn’t get across as offensive.
I would equally say that in the early
seventies, DB researchers
thought of the relational model as being
in the ivory tower forever.
Hope this is clearer now.
Excellent post! I’ll comment on one narrow misconception: “Surely, companies have their own interesting data, and industrial labs have access to such data and real-life workloads”
Alas, this access has been far from sure in my experience. At MITRE, the data often belongs to our government customers, who for legal constraints, security policy, or just risk averseness, are reluctant to share. For example, it can be illegal for the government to share data for research purposes with an outsider, or can require great legal contortions.
Even within a company, the operating divisions may own the data, and may be reluctant to share it with the research lab.
Making the data available publicly is an order of magnitude harder, of course.
Thanks, Arnie, for pointing this out.
So even for some branches of industry,
it would be important to have more
interesting data resources publicly
available.
This would not be propietary company
data made public (which I agree
is awfully hard if not impossible)
but other kinds of interesting data that
are already public, like data on energy,
health, etc.
I also agree with Gerhard very much. Yet, an obvious difference between creating the TPC benchmark and creating a gold standard for DBLP author-cleansing is that the latter cannot be automated.
But there might be a solution: We should have a look at other communities, such as psychology and many natural sciences: it is often a requirement for a bachelor and masters degree that students participate as subjects in behavioral experiments, gather data on field trips (geography), or assist in lab experiments. Essentially, the students are gathering data points for research as part of their curriculum. The same could be done in CS, especially in fields where only experts can judge the quality of an experimental result.
Gerhard, a hearty “Hear Hear” – many good points. I think that the emerging web of data from govts and such (we have collected metadata for going on 1.2M web-accessible datasets) is becoming a great play ground for research – but these tend not to be social data – some good datasets that were representative of what is seen in real social web spaces would also be a great asset. A sharing of existing assets would also be a plus – Wendy Hall has been leading an effort for a “Web Science Observatory” to make it possible to collect and share datasets about Web use – interestingly the most resistance has come from the traditional database community… (see http://webscience.org re: observatory)
If I think back to 1995 when we (org where I wrkd) processed join hundreds of records on a machine with the fraction of the power of the modern desktop we recoded about half the columns as integers and drastically reduced the volume of the data and eliminated working with comparisons of integers and filtering on integer of one. two. or four bytes and bit fields at the time a three hundred year process was reduced to six hours, I moved up in the world to look at hundreds of gigabytes of data in a data warehouse. At the time I believe the FBI was working with peta bytes having been out of the “Ivory Towers” for years. I saw one little piece of information for a faster algorithm in KDD 2010 than page shrinkage. Yes that was cool. However business is pulling another chicken little and it and it shows up in the ACM. Instead of statisticians you have a data scientists. Rhetorically that was a cheap shot but this is nothing so astonishingly new. Argonne National Laboratories has dealt with large amounts of data. If I go further I need to supply footnotes. I am hoping members of the ACM unlike many people I have encountered in business can tolerate a negative point of view instead of always rooting for the team. If somebody pays attention to this and wants verification holler.