“Both theoretical and empirical research may be unnecessarily complicated by failure to recognize the effects of heterogeneity” – Vaupel & Yashin
Big Data is daily topic of conversation among data analysts, with much said and written about its promises and pitfalls. The issue of heterogeneity, however, has received scant attention. This is unfortunate, since failing to take heterogeneity into account can easily derail the discoveries one makes using these data.
This issue, which some may recognize as an example of ecological fallacy, first came to my attention via a paper elegantly titled “Heterogeneity’s ruses: some Surprising Effects of Selection on Population Dynamics” (Vaupel and Yashin, 1985). Authors discuss a variety of examples where the aggregated behavior of a heterogeneous population, composed of two homogeneous but differently behaving subpopulations, will differ from the behavior of any single individual. Consider the following example. It has been observed that the recidivism rate of convicts released from prison declines with time. A natural conclusion one may reach from this observation is that former convicts are less likely to commit crime as they age. However, this is false. In reality, there may be two groups of individuals “reformed” and “incorrigible” with constant – but different – recidivism rates. With time, there will be more “reformed” individuals left in the population, as the “incorrigibles” are sent back to prison, resulting in decreasing recidivism rate for the population as a whole. This simple example shows that “the patterns observed [at population level] may be surprisingly different from the underlying patterns on the individual level. Researchers interested in uncovering these individual patterns, perhaps to help develop or test theories or to make predictions, might benefit from an “understanding of heterogeneity’s ruses.” (Vaupel & Yashin)
My colleagues and I have been tricked by heterogeneity time and again. As one example, our study of information spread on the follower graphs of Twitter and Digg revealed that it was surprisingly different from the simple epidemics that are often used to model information spread. In a simple epidemic, described, for example, as an independent cascade model, the probability of infection increases monotonically with the number of exposures to infected friends. This probability is measured by the exposure response function. The figure below shows the exposure response function we measured on Twitter: the probability for becoming infected (i.e., retweet) information (a URL) on Twitter as a function of how many friends had previously tweeted this information. In contrast to epidemics, it appears as though repeated exposure to information suppresses infection probability. We measured an even more pronounced suppression of infection on Digg [Ver Steeg et al, 2011], and a similar exposure response was observed for hashtag adoption following friends’ use of them [Romero et al, 2011].
It is easy to draw wrong conclusions from this finding. In “What stops social epidemics?” [Ver Steeg et al, 2011], we reported that information spread on Digg is quickly extinguished, and attributed this to the exposure response function. We speculated that initial exposures “inoculate” users to information, so that they will not become infected (i.e., propagate it) despite multiple exposures. Now we know this explanation was completely wrong.
The exposure response function, while aggregated over all users, does not describe the behavior of any individual Digg or Twitter user – even the hypothetical “typical” user. In fact, there is no “typical” Twitter (or Digg) user. Twitter users are extremely heterogeneous. Separating them into more homogeneous sub-populations reveals a more regular pattern. Figure 2 shows the exposure response function for different populations of Twitter users, separated according to the number of friends they follow (large fluctuations are the result of small sample size). Why number of friends? This is explained in more detail in our papers [Hodas & Lerman 2012, 2013], but in short, we found it useful to separate users according to their cognitive load, i.e., the volume of information they receive, which is (on average) proportional to the number of friends they follow [Hodas et al, 2013]. Now, the probability that a user within each population will become infected increases monotonically with the number of infected, very similar to the predictions of the independent cascade model.
Figure 2 has a different, more significant interpretation, with consequences for information diffusion. It suggests that highly connected users, i.e., those who follow many others, are less susceptible to becoming infected. Their decreased susceptibility in fact explains Figure 1: as one moves to the right of the exposure response curve, only the better connected, and less sensitive, users contribute to that portion of the response. However, despite their reduced susceptibility, highly connected users respond positively to repeated exposures, like all other users. You do not inhibit response by repeatedly exposing people to information. Instead, the reason that these users are less susceptible hinges on the human brain’s limited bandwidth. There are only so many tweets any one can read, the more tweets you receive (on average proportional to the number of friends you follow), the less likely you are to see – and retweet – any specific tweet. If it was not for recognizing heterogeneity, we would not have found this far more interesting explanation.
| Blogger’s Profile:
Kristina Lerman is a Project Leader at the University of Southern California Information Sciences Institute and holds a joint appointment as a Research Associate Professor in the USC Computer Science Department. After a brief stint as a theoretical roboticist, she found her calling in blending together methods from physics, computer science and social science to address problems in social computing and social media analysis. She writes many papers that are greatly enjoyed by all of their twenty readers.
I was recently approached by an entrepreneur who had an interesting way to correlate short term performance of a stock with news reports about the stock. Needless to say, there are many places from which one can get the news, and what results one gets from this sort of analysis does depend on the input news sources. Surprisingly, within two minutes the conversation had drifted from characteristics of news sources to the challenges of running SVM on Hadoop. The reason for this is not that Hadoop is the right infrastructure for this problem. But rather that the problem can legitimately be considered a Big Data problem. In consequence, in the minds of many, it must be addressed by running analytics in the cloud.
I have nothing against cloud services. In fact, I think they are an important part of the computational eco-system, permitting organizations to out-source selected aspects of their computational needs, and to provision peak capacity for load bursts. The map-reduce paradigm is a fantastic abstraction with which to handle tasks that are “embarrassingly parallelizable.” In short, there are many circumstances in which cloud services are called for. However, they are not always the solution, and are rarely the complete solution. For the stock price data analysis problem, based solely on the brief outline I’ve given you, one cannot say whether they are appropriate.
I have nothing against Support Vector Machines, or other machine learning techniques. They can be immensely useful, and I have used them myself in many situations. Scaling up these techniques for large data sets can be an issue, and certainly is a Big Data challenge. But for the problem at hand, I would be much more concerned about how it was modeled than how the model was scaled. What should the features be? Do we worry about duplicates in news appearances? Into how many categories should we classify news mentions? These are by far the more important questions to answer, because how we answer them can change what results we get: scaling better will only change how fast we get them.
It is hard to avoid mention of Big Data anywhere we turn today. There is broad recognition of the value of data, and products obtained through analyzing it. Industry is abuzz with the promise of big data. Government agencies have recently announced significant programs towards addressing challenges of big data. Yet, many have a very narrow interpretation of what that means, and we lose track of the fact that there are multiple steps to the data analysis pipeline, whether the data are big or small. At each step, there is work to be done, and there are challenges with Big Data.
The first step is data acquisition. Some data sources, such as sensor networks, can produce staggering amounts of raw data. Much of this data is of no interest, and it can be filtered and compressed by orders of magnitude. One challenge is to define these filters in such a way that they do not discard useful information. For example, in considering news reports, is it enough to retain only those that mention the name of a company of interest? Do we need the full report, or just a snippet around the mentioned name? The second big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured. This metadata is likely to be crucial to downstream analysis. For example, we may need to know the source for each report if we wish to examine duplicates.
Frequently, the information collected will not be in a format ready for analysis. The second step is an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis. A news report will get reduced to a concrete structure, such as a set of tuples, or even a single class label, to facilitate analysis. Furthermore, we are used to thinking of Big Data as always telling us the truth, but this is actually far from reality. We have to deal with erroneous data: some news reports are inaccurate.
Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis all of this has to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then “robotically” resolvable. Even for simpler analyses that depend on only one data set, there remains an important question of suitable database design. Usually, there will be many alternative ways in which to store the same information. Certain designs will have advantages over others for certain purposes, and possibly drawbacks for other purposes.
Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big-data computing environments. A problem with current Big Data analysis is the lack of coordination between database systems, which host the data and provide SQL querying, with analytics packages that perform various forms of non-SQL processing, such as data mining and statistical analyses. Today’s analysts are impeded by a tedious process of exporting data from the database, performing a non-SQL process and bringing the data back.
Having the ability to analyze Big Data is of limited value if users cannot understand the analysis. Ultimately, a decision-maker, provided with the result of analysis, has to interpret these results. Usually, this involves examining all the assumptions made and retracing the analysis. Furthermore, as we saw above, there are many possible sources of error: computer systems can have bugs, models almost always have assumptions, and results can be based on erroneous data. For all of these reasons, users will try to understand, and verify, the results produced by the computer. The computer system must make it easy for her to do so by providing supplementary information that explains how each result was derived, and based upon precisely what inputs.
In short, there is a multi-step pipeline required to extract value from data. Heterogeneity, incompleteness, scale, timeliness, privacy and process complexity give rise to challenges at all phases of the pipeline. Furthermore, this pipeline isn’t a simple linear flow – rather there are frequent loops back as downstream steps suggest changes to upstream steps. There is more than enough here that we in the database research community can work on.
To highlight this fact, several of us got together electronically last winter, and wrote a white paper, available at http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf . Please read it, and say what you think. The database community came very late to much of the web. We should make sure not to miss the boat on Big Data.
My post is loosely based on an extract from this white paper, which was created through a distributed conversation among many prominent researchers listed below.
Divyakant Agrawal, UC Santa Barbara
| Blogger’s Profile:
H. V. Jagadish is Bernard A Galler Collegiate Professor of Electrical Engineering and Computer Science and Director of the Software Systems Research Laboratory at the University of Michigan , Ann Arbor. He is well-known for his broad-ranging research on information management, and particularly its use in biology, medicine, telecommunications, finance, engineering, and the web. He is an ACM Fellow and founding Editor in Chief of PVLDB. He serves on the board of the Computing Research Association.
tl;dr: MADlib is an open-source library of scalable in-database algorithms for machine learning, statistics and other analytic tasks. MADlib is supported with people-power from Greenplum; researchers at Berkeley, Florida and Wisconsin are also contributing. The project recently released a MADlib TR, and is now welcoming additional community contributions.
Warehousing → Science
Back in 2008, I had the good fortune to fall in with a group of data professionals documenting new usage patterns in scalable analytics. It was an interesting team: a computational advertising analyst at a large social networking firm, a seasoned DBMS consultant formerly employed at a major Internet retailer, a pair of DBMS engine developers and an academic.
The usage patterns we were seeing represented a shift from accountancy to analytics—from the cautious record-keeping of “Data Warehousing” to the open-ended, predictive task of “Data Science”. This shift was turning many Data Warehousing tenets on their heads. Rather than “architecting” an integrated permanent record that repelled data until it was well-conditioned, the groups we observed were interested in fostering a data-centric computational “watering hole”, where analysts could bring any kind of relevant data into a shared infrastructure, and experiment with ad-hoc integration and rich algorithmic analysis at very large scales.
In response to the dry TLAs of Data Warehousing, we dubbed this usage model MAD, to reflect
We wrote the MAD Skills paper in VLDB 2009 to capture these practices in broad terms. The paper describes the usage patterns mentioned above in more detail. It also includes a fairly technical section with a number of non-trivial analytics techniques adapted from the field, implemented via simple SQL excerpts.
MADlib (MAD Skills, the SQL)
When we released the MAD Skills paper, many people were interested not only in its design aspects, but also in the promise of sophisticated statistical methods in SQL. This interest came from multiple directions: DBMS customers were requesting it of consultants and vendors, and academics were increasingly publishing papers on in-database analytics. What was missing was a software framework to harness the energy of the community, and connect the various interested constituencies.
To this end, a group formed to build MADlib, a free, open-source library of SQL-based algorithms for machine learning, statistics, and related analytic tasks. The methods in MADlib are designed both for in- and out-of-core execution, and for the shared-nothing, “scale-out” parallelism offered by modern parallel database engines, ensuring that computation is done close to the data. The core functionality is written in declarative SQL statements, which orchestrate data movement to and from disk, and across networked machines. Single-node inner loops take advantage of SQL extensibility to call out to high-performance math libraries (currently, Eigen) in user-defined scalar and aggregate functions. At the highest level, tasks that require iteration and/or structure definition are coded in Python driver routines, which are used only to kick off the data-rich computations that happen within the database engine.
The primary goal of the MADlib open-source project is to accelerate innovation and technology transfer in the Data Science community via a shared library of scalable in-database analytics, much as the CRAN library serves the R community. Unlike CRAN, which is customized to the R analytics tool, we hope that MADlib’s grounding in standard SQL can result in community ports to a variety of parallel database engines.
Open-Source Algorithms in Parallel DBMSs?
The state of scalable analytics today depends very much on who you talk to.
The motivation for considering parallel databases comes from both the database market and technology issues. There is a large and growing installed base of massively parallel commercial DBMSs in industry, fueled in part by a recent wave of startup acquisitions. Meanwhile, it is no surprise to database researchers that a massively parallel DBMS is a powerful platform for dataflow programming of sophisticated analytic algorithms. Research on sophisticated in-database analytics has been growing in recent years, in part as an offshoot of work on Probabilistic Databases. Education is hopefully shifting as well. For example, in my own CS186 database course this spring, the students not only wrote traditional SQL queries, they also had to implement a non-trivial social network analysis algorithm in SQL (betweenness centrality).
The open-source nature of MADlib represents a serious commitment by the entire team, and differs from the proprietary approaches traditionally associated with DBMS vendors. The decision to go open-source was motivated by a number of goals, including:
MADlib is still young, at Version 0.3. The initial versions focused on establishing infrastructure and a baseline of textbook and some advanced methods; this initial suite actually covers a fair bit of ground (Table 1). Most methods were chosen because they were frequently requested from customers we met through contacts at Greenplum. More recently, we made a point of validating MADlib as a research vehicle, by fostering a small number of university groups who were working in the area to experiment with the platform and get their code disseminated. Profs. Chris Ré at Wisconsin and Daisy Wang at Florida have written up their work in a MADLib tech report that expands upon this post.
MADlib is currently ported to PostgreSQL (single-node, open-source) and Greenplum (shared-nothing parallel, commercial). Greenplum inherits the PostgreSQL extensibility interfaces almost completely, so these two ports were easy to pursue simultaneously in the early days of the project. Another attraction of Greenplum is that it offers a free download of a massively parallel DBMS for researchers, so there is no limitation on scaling experiments. (This is surprisingly unusual: most DBMS vendors still only advertise free trial downloads of “crippleware” that artificially limits database size or the number of nodes. I would imagine that market forces will change this story relatively soon.)
MADlib is hosted publicly at github, and readers are encouraged to browse the code and documentation via the MADlib website. The initial MADlib codebase reflects contributions from both industry (a team at Greenplum) and academia (Berkeley, Wisconsin, Florida). Project oversight and Quality Assurance efforts have been contributed by Greenplum. Our MADlib TR expands on the architecture and status, and also includes extensive discussion of related work.
At this time, MADlib is ready to consider contributions from additional parties, including both new methods and ports to new platforms. Like any serious open-source project, contributions will have to be managed carefully to maintain code quality. I hope that more researchers will find it worthwhile to contribute serious code to the MADlib effort. It’s a bit more work than getting an algorithm ready to run experiments in a paper, but it’s really satisfying to develop and refine production-quality open-source code, and get it delivered to end-users. If you are doing research on scalable analytic methods, consider going the extra mile and contributing your code to the MADlib effort.
For more information on MADlib, please see the website at http://madlib.net.
Thanks to Chris Ré, Florian Schoppmann and Daisy Wang for their help writing up the recent MADlib TR that this post excerpts, and to Azza Abouzied, Peter Bailis, and Neil Conway for feedback on this version.
Joseph M. Hellerstein is a Chancellor’s Professor of Computer Science at the University of California, Berkeley, whose work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Research Fellow and the recipient of two ACM-SIGMOD “Test of Time” awards for his research. In 2010, Fortune Magazine included him in their list of 50 smartest people in technology , and MIT’s Technology Review magazine included his Bloom language for cloud computing on their TR10 list of the 10 technologies “most likely to change our world”. A past research lab director for Intel, Hellerstein maintains an active role in the high tech industry, currently serving on the technical advisory boards of a number of computing and Internet companies including EMC, SurveyMonkey, Platfora and Captricity.