“Both theoretical and empirical research may be unnecessarily complicated by failure to recognize the effects of heterogeneity” – Vaupel & Yashin
Big Data is daily topic of conversation among data analysts, with much said and written about its promises and pitfalls. The issue of heterogeneity, however, has received scant attention. This is unfortunate, since failing to take heterogeneity into account can easily derail the discoveries one makes using these data.
This issue, which some may recognize as an example of ecological fallacy, first came to my attention via a paper elegantly titled “Heterogeneity’s ruses: some Surprising Effects of Selection on Population Dynamics” (Vaupel and Yashin, 1985). Authors discuss a variety of examples where the aggregated behavior of a heterogeneous population, composed of two homogeneous but differently behaving subpopulations, will differ from the behavior of any single individual. Consider the following example. It has been observed that the recidivism rate of convicts released from prison declines with time. A natural conclusion one may reach from this observation is that former convicts are less likely to commit crime as they age. However, this is false. In reality, there may be two groups of individuals “reformed” and “incorrigible” with constant – but different – recidivism rates. With time, there will be more “reformed” individuals left in the population, as the “incorrigibles” are sent back to prison, resulting in decreasing recidivism rate for the population as a whole. This simple example shows that “the patterns observed [at population level] may be surprisingly different from the underlying patterns on the individual level. Researchers interested in uncovering these individual patterns, perhaps to help develop or test theories or to make predictions, might benefit from an “understanding of heterogeneity’s ruses.” (Vaupel & Yashin)
My colleagues and I have been tricked by heterogeneity time and again. As one example, our study of information spread on the follower graphs of Twitter and Digg revealed that it was surprisingly different from the simple epidemics that are often used to model information spread. In a simple epidemic, described, for example, as an independent cascade model, the probability of infection increases monotonically with the number of exposures to infected friends. This probability is measured by the exposure response function. The figure below shows the exposure response function we measured on Twitter: the probability for becoming infected (i.e., retweet) information (a URL) on Twitter as a function of how many friends had previously tweeted this information. In contrast to epidemics, it appears as though repeated exposure to information suppresses infection probability. We measured an even more pronounced suppression of infection on Digg [Ver Steeg et al, 2011], and a similar exposure response was observed for hashtag adoption following friends’ use of them [Romero et al, 2011].
It is easy to draw wrong conclusions from this finding. In “What stops social epidemics?” [Ver Steeg et al, 2011], we reported that information spread on Digg is quickly extinguished, and attributed this to the exposure response function. We speculated that initial exposures “inoculate” users to information, so that they will not become infected (i.e., propagate it) despite multiple exposures. Now we know this explanation was completely wrong.
The exposure response function, while aggregated over all users, does not describe the behavior of any individual Digg or Twitter user – even the hypothetical “typical” user. In fact, there is no “typical” Twitter (or Digg) user. Twitter users are extremely heterogeneous. Separating them into more homogeneous sub-populations reveals a more regular pattern. Figure 2 shows the exposure response function for different populations of Twitter users, separated according to the number of friends they follow (large fluctuations are the result of small sample size). Why number of friends? This is explained in more detail in our papers [Hodas & Lerman 2012, 2013], but in short, we found it useful to separate users according to their cognitive load, i.e., the volume of information they receive, which is (on average) proportional to the number of friends they follow [Hodas et al, 2013]. Now, the probability that a user within each population will become infected increases monotonically with the number of infected, very similar to the predictions of the independent cascade model.
Figure 2 has a different, more significant interpretation, with consequences for information diffusion. It suggests that highly connected users, i.e., those who follow many others, are less susceptible to becoming infected. Their decreased susceptibility in fact explains Figure 1: as one moves to the right of the exposure response curve, only the better connected, and less sensitive, users contribute to that portion of the response. However, despite their reduced susceptibility, highly connected users respond positively to repeated exposures, like all other users. You do not inhibit response by repeatedly exposing people to information. Instead, the reason that these users are less susceptible hinges on the human brain’s limited bandwidth. There are only so many tweets any one can read, the more tweets you receive (on average proportional to the number of friends you follow), the less likely you are to see – and retweet – any specific tweet. If it was not for recognizing heterogeneity, we would not have found this far more interesting explanation.
| Blogger’s Profile:
Kristina Lerman is a Project Leader at the University of Southern California Information Sciences Institute and holds a joint appointment as a Research Associate Professor in the USC Computer Science Department. After a brief stint as a theoretical roboticist, she found her calling in blending together methods from physics, computer science and social science to address problems in social computing and social media analysis. She writes many papers that are greatly enjoyed by all of their twenty readers.
Archive for Big Data
Big Data should be Interesting Data!
There are various definitions of Big Data; most center around a number of V’s like volume, velocity, variety, veracity – in short: interesting data (interesting in at least one aspect). However, when you look into research papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see here in experimental studies is utterly boring. Performance and scalability experiments are often based on the TPC-H benchmark: completely synthetic data with a synthetic workload that has been beaten to death for the last twenty years. Data quality, data cleaning, and data integration studies are often based on bibliographic data from DBLP, usually old versions with less than a million publications, prolific authors, and curated records. I doubt that this is a real challenge for tasks like entity linkage or data cleaning. So where’s the – interesting – data in Big Data research?
Surely, companies have their own interesting data, and industrial labs have access to such data and real-life workloads. However, all this is proprietary and out of reach for academic research. Therefore, many researchers resort to the good old TPC-H benchmark and DBLP records and call it Big Data. Insights from TPC-H and DBLP are, however, usually not generalizable to interesting and truly challenging data and workloads. Yes, there are positive exceptions; I just refer to a general trend.
Looking Across the Fence: Experimental Data in other Research Communities
Now that I got you alerted, let me be constructive. I have also worked in research communities other than database systems: information retrieval, Web and Semantic Web, knowledge management (yes, a bit of AI), and recently also computational linguistics (aka. NLP). These communities have a different mindset towards data resources and their use in experimental work. To them, data resources like Web corpora, annotated texts, or inter-linked knowledge bases are vital assets for conducting experiments and measuring the progress in the field. These are not static benchmarks that are defined once every ten years; rather, relevant resources are continuously crafted and their role in experiments is continuously re-thought. For example, the IR community has new experimental tasks and competitions in the TREC, INEX, and CLEF conferences each year. Computational linguistics has an established culture of including the availability of data resources and experimental data (such as detailed ground-truth annotations) in the evaluation of submissions to their top conferences like ACL, EMNLP, CoNLL, and LREC. Review forms capture this aspect as an important dimension for all papers, not just for a handful of specific papers tagged Experiments & Analyses.
Even the Semantic Web community has successfully created a huge dataset for experiments: the Web of Linked Data consisting of more than 30 Billion RDF triples from hundreds of data sources with entity-level sameAs linkage across sources. What an irony: ten years ago we database folks thought of Semantic Web people as living in the ivory tower, and now they have more data to play with than we (academic database folks) can dream of.
Towards a Culture Shift in Our Community
Does our community lack the creativity and agility that other communities exhibit? I don’t think so. Rather I believe the problem lies in our publication and experimental culture. Aspects of this topic were discussed in earlier posts on the SIGMOD blog, but I want to address a new angle. We have over-emphasized publications as an achievement by itself: our community’s currency is the paper count rather than the intellectual insight and re-usable contribution. Making re-usable software available is appreciated, but it’s a small point in the academic value system when it comes to hiring, tenure, or promotion decisions. Contributing data resources plays an even smaller role. We need to change this situation by rewarding work on interesting data resources (and equally on open-source software): compiling the data, making it available to the community, and using it in experiments.
There are plenty of good starting points. The Web of Linked Data, with general-purpose knowledge bases (DBpedia, Freebase, Yago) and a wealth of thematically focused high-quality sources (e.g., musicbrainz, geospecies, openstreetmap, etc.), is a great opportunity. This data is huge, structured but highly heterogeneous, and includes substantial parts of uncertain or incomplete nature. Internet archives and Web tables (embedded in HTML pages) are further examples; enormous amounts of interesting data are easily and legally available by crawling or download. Finally, in times when energy, traffic, environment, health, and general sustainability are key challenges on our planet, more and more data by public stakeholders is freely available. Large amounts of structured and statistical data can be accessed at organizations like OECD, WHO, Eurostat, and many others.
Merely pointing to these opportunities is not enough. We must give more incentives that papers do indeed provide new interesting data resources and open-source software. The least thing to do is to extend review reports to include the contribution of novel data and software. A more far-reaching step is to make data and experiments an essential part of the academic currency: how many of your papers contributed data resources, how many contributed open-source software? This should matter in hiring, tenure, and promotion decisions. Needless to say, all this applies to non-trivial, value-adding data resource contributions. Merely converting a relational database into another format is not a big deal.
I believe that computational linguistics is a great role model for experimental culture and the value of data. Papers in premier conferences earn extra credit when accompanied with data resources, and there are highly reputed conferences like LREC which are dedicated to this theme. Moreover, papers of this kind or even the data resources themselves are frequently cited. Why don’t we, the database community, adopt this kind of culture and give data and data-driven experiments the role that they deserve in the era of Big Data?
Is the Grass Always Greener on the Other Side of the Fence?
Some people may argue that rapidly changing setups for data-driven experiments are not viable in our community. In the extreme, every paper could come with its own data resources, making it harder to ensure the reproducibility of experimental results. So we should better stick to established benchmarks like TPC-H and DBLP author cleaning. This is the opponent’s argument. I think the argument that more data resources hinder repeatability is flawed and merely a cheap excuse. Rather, a higher rate of new data resources and experimental setups goes very well with calling upon the authors’ obligation to ensure reproducible results. The key is to make the publication of data and full details of experiments mandatory. This could be easily implemented in the form of supplementary material that accompanies paper submissions and, for accepted papers, would also be archived in the publication repository.
Another argument could be that Big Data is too big to effectively share. However, volume is only one of the criteria for making a dataset Big Data, that is, interesting for research. We can certainly make 100 Gigabytes available for download, and organizations like NIST (running TREC), LDC (hosting NLP data), and the Internet Archive prove that even Terabytes can be shared by asking interested teams to pay a few hundred dollars for shipping disks.
A caveat that is harder to counter is that real-life workloads are so business-critical that they can impossibly be shared. Yes, there were small scandals about query-and-click logs from search engines as they were not properly anonymized. However, the fact that engineers did not do a good job in these cases does not mean that releasing logs and workloads is out of the question. Why would it be impossible to publish a small representative sample of analytic queries over Internet traffic data or advertisement data? Moreover, if we focus on public data hosted by public services, wouldn’t it be easy to share frequently posed queries?
Finally, a critical issue to ponder on is the position of industrial research labs. In the SIGMOD repeatability discussion a few years ago, they made it a point that software cannot be disclosed. Making experimental data available is a totally different issue, and would actually avoid the problem with proprietary software. Unfortunately, we sometimes see papers from industrial labs that show impressive experiments, but don’t give details nor any data and leave zero chance for others to validate the papers’ findings. Such publications that crucially hinge on non-disclosed experiments violate a major principle of good science: the falsifiability of hypotheses, as formulated by the Austrian-British philosopher Karl Popper. So what should industrial research groups do (in my humble opinion)? They should use public data in experiments and/or make their data public (e.g., in anonymized or truncated form, but in the same form that is used in the experiments). Good examples in the past include the N-gram corpora that Microsoft and Google released. Papers may use proprietary data in addition, but when a paper’s contribution lives or dies with a large non-disclosed experiment, the paper cannot be properly reviewed by the open research community. For such papers, which can still be insightful, conferences have industrial tracks.
Last but not least, who could possibly act on this? Or is all this merely public whining, without addressing any stakeholders? An obvious answer is that the steering boards and program chairs of our conferences should reflect and discuss these points. It should not be a complex maneuver to extend the reviewing criteria for the research tracks of SIGMOD, VLDB, ICDE, etc. This would be a step in the right direction. Of course, truly changing the experimental culture in our community and influencing the scholarly currency in the academic world is a long-term process. It is a process that affects all of us, and should be driven by each of you. Give this some thought when writing your next paper with data-intensive experiments.
The above considerations are food for thought, not a recipe. If you prefer a concise set of tenets and recommendations at the risk of oversimplification, here is my bottom line:
Overall, we need a culture shift to encourage more work on interesting data for experimental research in the Big Data wave.
| Blogger’s Profile:
Gerhard Weikum Gerhard Weikum is a Research Director at the Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also an adjunct professor in the Department of Computer Science of Saarland University in Saarbruecken, Germany, and he is a principal investigator of the Cluster of Excellence on Multimodal Computing and Interaction. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He received his diploma and doctoral degrees from the University of Darmstadt, Germany.
Big Data is the buzzword in the database community these days. Two of the first three blog entries of the SIGMOD blog are on Big Data. There was a plenary research session with invited talks at the 2012 SIGMOD Conference and there will be a panel at the 2012 VLDB Conference. Probably, everything has already been said that can be said. So, let me just add my own personal data point to the sea of existing opinions and leave it to the reader whether I am adding to the “signal” or adding to the “noise”. This blog entry is based on the talk that I gave at SIGMOD 2012 and the slides of that talk can be found at http://www.systems.ethz.ch/Talks .
Upfront, I would like to make clear that I am a believer. Stepping back, I am asking myself why do I work on Big Data technologies? I came up with two potential reasons:
In the following, I would like to explain my personal view on these two reasons.
Making the World a Better Place
The real question to ask is whether bigger = smarter? The simple answer is “yes”. The success of services like the Google and Bing are evidence for the “bigger = smarter” principle. The more data you have and can process, the higher the statistical relevance of your analysis and the better answers you get. Furthermore, Big Data allows you to make statements about corner cases and the famous “long tail”. Putting it differently, “experience” is more valuable than “thinking”.
The more complicated answer to the question whether bigger is smarter is “I do not know”. My concern is that the bigger Big Data gets, the more difficult we make it for humans to get involved. Who wants to argue with Google or Bing? At the end, all we can do is trust the machine learning. However, Big Data analytics needs as much debugging as any other software we produce and how can we help people to debug a data-driven experiment with 5 PB of data? Putting it differently, what do you make out of an experiment that validates your hypothesis with 5 PB of data but does not validate your hypothesis with, say, 1 KB of data using the same piece of code? Should we just trust the “bigger = smarter” principle and use the results of the 5 PB experiment to claim victory?
The more fundamental problem is that Big Data technologies tempt us into doing experiments for which we have no ground truth. Often, the absence of a ground truth is the reason of using Big Data: If we knew the answer already, we would not need Big Data. Despite all the mathematical and statistical tools that are available today, however, debugging a program without knowing what the program should be doing is difficult. To give an example: Let us assume that a Big Data study revealed that the left most lane is the fastest lane in a traffic jam. What does this result mean? Does it mean that we should all be going on the left lane? Does it mean that people on the left lane are more aggressive? Or does it mean that people on the left lane just believe that they are faster? This example combines all the problems of discovering facts without a ground truth: By asking the question, you are biasing the result. And by getting a result, you might be biasing the future result, too. (And, of course, if you had done the same study only looking at data from Great Britain, you might have come to the opposite conclusion that the right most lane is the fastest.)
Google Translate is a counter example and clearly a Big Data success story: Here, we do know the ground truth and Google developers are able to debug and improve Google Translate based on that ground truth – at least as long as we trust our own language skills more than we trust Google. (When it comes to spelling, I actually already trust Google and Bing more than I trust myself. )
Maybe, all I am trying to say is that we need to be more careful in what we promise and do not forget to keep the human in the loop. I trust statisticians that “bigger is smarter”, but I also believe that humans are even smarter and the combination is what is needed, thereby letting each party do what it is best at.
Because We Can
Unfortunately, we cannot make humans become smarter (and we should not even try), but we can try to make Big Data bigger. Even though I argued in the previous section that it is not always clear that bigger Big Data makes the world a better or smarter place, we as a data management community should be constantly pushing to make Big Data bigger. That is, we should build data management tools that scale, perform well, and are cost effective and get continuously better in all regards. Honestly, I do not know how that will make the world a better place, but I am optimistic that it will: History teaches that good things will happen if you do good work. Also, we should not be shy to make big promises such as processing 100 PB of heterogeneous data in real-time – if that is what our customers want and are willing to pay for. We should also continue to encourage people to collect all the data and then later think about what to do with it. If there are risks in doing all that (e.g., privacy risks), we need to look at those, too, and find ways to reduce those risks and still become better at our core business of becoming bigger, faster, and cheaper. We might not be
There are two things that we need to change, however. First, we need to build systems that are explicit about the utility / cost tradeoff of Big Data. Mariposa pioneered this idea in the Nineties; in Mariposa, utility was defined as response time (the faster the higher the utility), but now things get more complicated: With Big Data, utility may include data quality, data diversity, and other statistical metrics of the data. We need tools and abstractions that allow users to explicitly specify and control these metrics.
Second, we need to package our tools in the right way so that users can use them. There is a reason why Hadoop is so successful even though it has so many performance problems. In my opinion, one of the reasons is that it is not a database system. Yet, it can be a database system if combined with other tools of the Hadoop eco-system. For instance, it can be a transactional database system if combined with HDFS, Zookeeper, and HBase. However, it can also become a logging system to help customer support if combined with HDFS and SOLR. And, of course, it can
| Blogger’s Profile:
Donald Kossmann is a professor in the Systems Group of the Department of Computer Science at ETH Zurich (Switzerland). He received his MS in 1991 from the University of Karlsruhe and completed his PhD in 1995 at the Technical University of Aachen. After that, he held positions at the University of Maryland, the IBM Almaden Research Center, the University of Passau, the Technical University of Munich, and the University of Heidelberg. He is a former associate editor of ACM Transactions on Databases and ACM Transactions on Internet Technology. He was a member of the board of trustees of the VLDB endowment from 2006 until 2011, and he was the program committee chair of the ACM SIGMOD Conf., 2009 and PC co-chair of VLDB 2004. He is an ACM Fellow. He has been a co-founder of three start-ups in the areas of Web data management and cloud computing.
I was recently approached by an entrepreneur who had an interesting way to correlate short term performance of a stock with news reports about the stock. Needless to say, there are many places from which one can get the news, and what results one gets from this sort of analysis does depend on the input news sources. Surprisingly, within two minutes the conversation had drifted from characteristics of news sources to the challenges of running SVM on Hadoop. The reason for this is not that Hadoop is the right infrastructure for this problem. But rather that the problem can legitimately be considered a Big Data problem. In consequence, in the minds of many, it must be addressed by running analytics in the cloud.
I have nothing against cloud services. In fact, I think they are an important part of the computational eco-system, permitting organizations to out-source selected aspects of their computational needs, and to provision peak capacity for load bursts. The map-reduce paradigm is a fantastic abstraction with which to handle tasks that are “embarrassingly parallelizable.” In short, there are many circumstances in which cloud services are called for. However, they are not always the solution, and are rarely the complete solution. For the stock price data analysis problem, based solely on the brief outline I’ve given you, one cannot say whether they are appropriate.
I have nothing against Support Vector Machines, or other machine learning techniques. They can be immensely useful, and I have used them myself in many situations. Scaling up these techniques for large data sets can be an issue, and certainly is a Big Data challenge. But for the problem at hand, I would be much more concerned about how it was modeled than how the model was scaled. What should the features be? Do we worry about duplicates in news appearances? Into how many categories should we classify news mentions? These are by far the more important questions to answer, because how we answer them can change what results we get: scaling better will only change how fast we get them.
It is hard to avoid mention of Big Data anywhere we turn today. There is broad recognition of the value of data, and products obtained through analyzing it. Industry is abuzz with the promise of big data. Government agencies have recently announced significant programs towards addressing challenges of big data. Yet, many have a very narrow interpretation of what that means, and we lose track of the fact that there are multiple steps to the data analysis pipeline, whether the data are big or small. At each step, there is work to be done, and there are challenges with Big Data.
The first step is data acquisition. Some data sources, such as sensor networks, can produce staggering amounts of raw data. Much of this data is of no interest, and it can be filtered and compressed by orders of magnitude. One challenge is to define these filters in such a way that they do not discard useful information. For example, in considering news reports, is it enough to retain only those that mention the name of a company of interest? Do we need the full report, or just a snippet around the mentioned name? The second big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured. This metadata is likely to be crucial to downstream analysis. For example, we may need to know the source for each report if we wish to examine duplicates.
Frequently, the information collected will not be in a format ready for analysis. The second step is an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis. A news report will get reduced to a concrete structure, such as a set of tuples, or even a single class label, to facilitate analysis. Furthermore, we are used to thinking of Big Data as always telling us the truth, but this is actually far from reality. We have to deal with erroneous data: some news reports are inaccurate.
Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis all of this has to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then “robotically” resolvable. Even for simpler analyses that depend on only one data set, there remains an important question of suitable database design. Usually, there will be many alternative ways in which to store the same information. Certain designs will have advantages over others for certain purposes, and possibly drawbacks for other purposes.
Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big-data computing environments. A problem with current Big Data analysis is the lack of coordination between database systems, which host the data and provide SQL querying, with analytics packages that perform various forms of non-SQL processing, such as data mining and statistical analyses. Today’s analysts are impeded by a tedious process of exporting data from the database, performing a non-SQL process and bringing the data back.
Having the ability to analyze Big Data is of limited value if users cannot understand the analysis. Ultimately, a decision-maker, provided with the result of analysis, has to interpret these results. Usually, this involves examining all the assumptions made and retracing the analysis. Furthermore, as we saw above, there are many possible sources of error: computer systems can have bugs, models almost always have assumptions, and results can be based on erroneous data. For all of these reasons, users will try to understand, and verify, the results produced by the computer. The computer system must make it easy for her to do so by providing supplementary information that explains how each result was derived, and based upon precisely what inputs.
In short, there is a multi-step pipeline required to extract value from data. Heterogeneity, incompleteness, scale, timeliness, privacy and process complexity give rise to challenges at all phases of the pipeline. Furthermore, this pipeline isn’t a simple linear flow – rather there are frequent loops back as downstream steps suggest changes to upstream steps. There is more than enough here that we in the database research community can work on.
To highlight this fact, several of us got together electronically last winter, and wrote a white paper, available at http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf . Please read it, and say what you think. The database community came very late to much of the web. We should make sure not to miss the boat on Big Data.
My post is loosely based on an extract from this white paper, which was created through a distributed conversation among many prominent researchers listed below.
Divyakant Agrawal, UC Santa Barbara
| Blogger’s Profile:
H. V. Jagadish is Bernard A Galler Collegiate Professor of Electrical Engineering and Computer Science and Director of the Software Systems Research Laboratory at the University of Michigan , Ann Arbor. He is well-known for his broad-ranging research on information management, and particularly its use in biology, medicine, telecommunications, finance, engineering, and the web. He is an ACM Fellow and founding Editor in Chief of PVLDB. He serves on the board of the Computing Research Association.