Entrepreneurship in Data Management Research

Chen Li

As data is becoming increasingly more important in our society, there are many successful companies doing data-related businesses. This field grows so fast that many new startups are launched with the goal to become the “next Google.” This trend also provides a lot of entrepreneurship opportunities for our community working on data management research. This blog describes my experiences of doing a startup (called SRCH2, http://www.srch2.com/) that commercializes university research. It also shares my own perspective on entrepreneurship in data management research.

This blog is based on the talk that I gave at the DBRank workshop at VLDB 2012 and the talk slides are available on my homepage.

SRCH2: Commercializing Data Management Research

One of the research topics I work on at UC Irvine is related to powerful search. It started when I talked to people at the UCI Medical School and asked the question: “What are your data management problems?”. One of the challenges they were facing was record linkage, i.e., identifying that two records from different data sources represent the same real-world entity. An important problem in this context is approximate string search, which is supporting queries with fuzzy matching predicates, such as finding records with keywords similar to the former California “Terminator” governor. While looking into the details, I realized that the problem was not solved on large data sets, so I started leading a research team to work on it. After several years, we developed several novel techniques, and released an open-source C++ package called Flamingo (http://flamingo.ics.uci.edu/), which received a lot of attention from academia and industry. I also took a leave from UCI to work as a visiting scientist at Google, and this experience was very beneficial. It not only showed me how large companies manage data management projects and solve challenging problems, but also taught me how to manage a research team in a university setting.

In 2008, when pushing our research to the UCI community, we identified one “killer app” domain: people search. We developed a system prototype called PSearch (http://psearch.ics.uci.edu/) that supports instant and error-tolerant search. The system gradually became popular on the campus and many people began using it on a daily basis. Many of them told me their personal stories in which they were able to find people quickly, despite their vague recall of names. Meanwhile, collaborating with colleagues at Tsinghua University, we were able to scale the techniques to larger data sets and developed another system called iPubmed (http://ipubmed.ics.uci.edu), which enabled the same features on 21 million MEDLINE publications. We also developed techniques in other domains, such as geo search.

As our systems became more and more popular, very often I got requests from users asking: “Can I run your engine on my own data sets?” As a former PhD from the Stanford Database Group, the home of many successful companies such as Junglee, Google, and Aster Data, I always had the dream of doing my own startup. Then the answer became very natural: “Why don’t we commercialize the results?”. So I incorporated a company in 2008, which was initially called “Bimaple,” and recently renamed to SRCH2 to better describe its search-related business. SRCH2 has developed a search engine (built from the ground up in C++) targeting enterprises that want to enable a Google-like search interface for their customers. It offers a solution similar to Lucene and Sphinx Search, but with more powerful features such as instant search, error correction, geo support, real-time updates, and customizable ranking. Currently its first products are developed and it has paying customers.

(Good) Lessons Learned

In the four years of doing the company so far, I have learned many things that are beyond my imagination. Here are some of the (good) lessons learned so far.

  • First Jump: from Technology to Products.
    To turn novel techniques to a successful business, we need to jump over two challenging gaps. The first one is between the techniques and products. In the research phase (especially at universities) we tend to focus on new ideas and prototyping as a proof of concept. However, the fact “it works!” doesn’t mean “it’s a product.” A significant amount of effort needs to be put into product development to make sure it is reliable and easy to use, and can meet customers’ needs. Talking to customers is very eye opening, and very often they mention new features that are indeed very challenging from a research perspective, such as concurrency control and real-time updates, to name a few.
  • Second Jump: from Products to Business.
    Good products don’t necessarily mean a successful business. A lot of effort is needed in non-technical areas, such as marketing, sales, fund raising, accounting, and legal paperwork. With a technical background, many researchers (including myself) are not “built” to be good at everything, thus we need to find good partners to develop the company in these directions. Therefore, finding the right partners to work with is extremely important, and I am very happy that SRCH2 has a strong team of members with complementary backgrounds.
  • Gaining Hands-On Experiences.
    As a researcher, I have always tried to be hands on in projects, and it’s one of the reasons for the success of the Flamingo project. A startup needs this skill even more, and product development needs good software engineering. This experience and training benefit my research as well, since I can give students a lot of low-level suggestions on their research.
  • Better Balancing skills.
    It’s challenging to balance between a faculty job and a startup, since both are very demanding, not to mention we have a family :-). This situation requires stronger skills to manage time, work efficiently, and communicate well with people.

In summary, my entrepreneurship experiences have been challenging but enjoyable and educational. I hope more of you take the adventure and commercialize your research. It can help you “think different.”

Blogger’s Profile:
Chen Li is a professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D.degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003 and many other NSF grants and industry gifts. He was once a part-time Visiting Research Scientist at Google. His research interests are in the fields of data management and information search, including text search and data-intensive computing. He was a recipient of the SIGMOD 2012 Test-of-Time award. He is the founder of SRCH2, a company providing powerful search solutions for enterprises and developers.

A Few More Words on Big Data

Big Data is the buzzword in the database community these days. Two of the first three blog entries of the SIGMOD blog are on Big Data. There was a plenary research session with invited talks at the 2012 SIGMOD Conference and there will be a panel at the 2012 VLDB Conference. Probably, everything has already been said that can be said. So, let me just add my own personal data point to the sea of existing opinions and leave it to the reader whether I am adding to the “signal” or adding to the “noise”. This blog entry is based on the talk that I gave at SIGMOD 2012 and the slides of that talk can be found at http://www.systems.ethz.ch/Talks .

Upfront, I would like to make clear that I am a believer. Stepping back, I am asking myself why do I work on Big Data technologies? I came up with two potential reasons:

  1. because we want to make the world a better place and
  2. because we can.

In the following, I would like to explain my personal view on these two reasons.

Making the World a Better Place

The real question to ask is whether bigger = smarter? The simple answer is “yes”. The success of services like the Google and Bing are evidence for the “bigger = smarter” principle. The more data you have and can process, the higher the statistical relevance of your analysis and the better answers you get. Furthermore, Big Data allows you to make statements about corner cases and the famous “long tail”. Putting it differently, “experience” is more valuable than “thinking”.

The more complicated answer to the question whether bigger is smarter is “I do not know”. My concern is that the bigger Big Data gets, the more difficult we make it for humans to get involved. Who wants to argue with Google or Bing? At the end, all we can do is trust the machine learning. However, Big Data analytics needs as much debugging as any other software we produce and how can we help people to debug a data-driven experiment with 5 PB of data? Putting it differently, what do you make out of an experiment that validates your hypothesis with 5 PB of data but does not validate your hypothesis with, say, 1 KB of data using the same piece of code? Should we just trust the “bigger = smarter” principle and use the results of the 5 PB experiment to claim victory?

The more fundamental problem is that Big Data technologies tempt us into doing experiments for which we have no ground truth. Often, the absence of a ground truth is the reason of using Big Data: If we knew the answer already, we would not need Big Data. Despite all the mathematical and statistical tools that are available today, however, debugging a program without knowing what the program should be doing is difficult. To give an example: Let us assume that a Big Data study revealed that the left most lane is the fastest lane in a traffic jam. What does this result mean? Does it mean that we should all be going on the left lane? Does it mean that people on the left lane are more aggressive? Or does it mean that people on the left lane just believe that they are faster? This example combines all the problems of discovering facts without a ground truth: By asking the question, you are biasing the result. And by getting a result, you might be biasing the future result, too. (And, of course, if you had done the same study only looking at data from Great Britain, you might have come to the opposite conclusion that the right most lane is the fastest.)

Google Translate is a counter example and clearly a Big Data success story: Here, we do know the ground truth and Google developers are able to debug and improve Google Translate based on that ground truth – at least as long as we trust our own language skills more than we trust Google. (When it comes to spelling, I actually already trust Google and Bing more than I trust myself. :-( )

Maybe, all I am trying to say is that we need to be more careful in what we promise and do not forget to keep the human in the loop. I trust statisticians that “bigger is smarter”, but I also believe that humans are even smarter and the combination is what is needed, thereby letting each party do what it is best at.

Because We Can

Unfortunately, we cannot make humans become smarter (and we should not even try), but we can try to make Big Data bigger. Even though I argued in the previous section that it is not always clear that bigger Big Data makes the world a better or smarter place, we as a data management community should be constantly pushing to make Big Data bigger. That is, we should build data management tools that scale, perform well, and are cost effective and get continuously better in all regards. Honestly, I do not know how that will make the world a better place, but I am optimistic that it will: History teaches that good things will happen if you do good work. Also, we should not be shy to make big promises such as processing 100 PB of heterogeneous data in real-time – if that is what our customers want and are willing to pay for. We should also continue to encourage people to collect all the data and then later think about what to do with it. If there are risks in doing all that (e.g., privacy risks), we need to look at those, too, and find ways to reduce those risks and still become better at our core business of becoming bigger, faster, and cheaper. We might not be
able to keep all these promises, but making these promises will keep us busy and at least we understand why we failed, rather than mumbling about traffic jams or other phenomena outside of our area of expertise.

There are two things that we need to change, however. First, we need to build systems that are explicit about the utility / cost tradeoff of Big Data. Mariposa pioneered this idea in the Nineties; in Mariposa, utility was defined as response time (the faster the higher the utility), but now things get more complicated: With Big Data, utility may include data quality, data diversity, and other statistical metrics of the data. We need tools and abstractions that allow users to explicitly specify and control these metrics.

Second, we need to package our tools in the right way so that users can use them. There is a reason why Hadoop is so successful even though it has so many performance problems. In my opinion, one of the reasons is that it is not a database system. Yet, it can be a database system if combined with other tools of the Hadoop eco-system. For instance, it can be a transactional database system if combined with HDFS, Zookeeper, and HBase. However, it can also become a logging system to help customer support if combined with HDFS and SOLR. And, of course, it can
easily become a data warehouse system and a great tool for scientists together with Mahout. Quoting Mike Stonebraker again, one size does not fit all. The lesson to learn from this observation, however, is not to build a different, dedicated system for each use case with a significant market. The more important lesson to learn is to repackage our technology and define the right, general-purpose building blocks that if put together can solve a large variety of different use
cases. As a community, I think that we are paying too little attention on defining the right interfaces and abstractions for our technology.

Blogger’s Profile:
Donald Kossmann is a professor in the Systems Group of the Department of Computer Science at ETH Zurich (Switzerland). He received his MS in 1991 from the University of Karlsruhe and completed his PhD in 1995 at the Technical University of Aachen. After that, he held positions at the University of Maryland, the IBM Almaden Research Center, the University of Passau, the Technical University of Munich, and the University of Heidelberg. He is a former associate editor of ACM Transactions on Databases and ACM Transactions on Internet Technology. He was a member of the board of trustees of the VLDB endowment from 2006 until 2011, and he was the program committee chair of the ACM SIGMOD Conf., 2009 and PC co-chair of VLDB 2004. He is an ACM Fellow. He has been a co-founder of three start-ups in the areas of Web data management and cloud computing.

Computer science publication culture: where to go from here?

Computer science publication culture and practices has become an active discussion topic. Moshe Vardi has written a number of editorials in Communications of ACM on the topic that can be found here and here, and these have generated considerable discussion. The conversation on this issue has been expanding and Jagadish has collected the writings on Scholarly Publications for CRA, which is a valuable resource.

The database community has pioneered discussions on publication issues. We have had panels at conferences, discussions during business meetings, informal conversations during conferences, discussions within SIGMOD Executive and the VLDB Endowment Board – we have been at this since about 2000. I wrote about one aspect of this back in 2002 in my SIGMOD Chair‘s message.

The initial conversation in the database community was due to the significant increase in the number of submitted papers to our conferences that we were experiencing year-after-year. The increasing number of submissions had started to severely stress our ability to meaningfully manage the conference reviewing process. It became quite clear, quite quickly, to a number of us that the overriding problem was our over-reliance on conferences that were not designed to fulfill the role that we were pushing them to play: being the final archival publication venues. I argued this point in my 2002 SIGMOD Chair’s message that I mentioned above. I ended that message by stating that we “have been very successful over the years in convincing tenure and promotion committees and university bodies about the value of the conferences (rightfully so), we now have to convince ourselves that journals are equally valuable and important venues to publish fuller research results.” The same topic was the focus of my presentation on the panel on “Paper and Proposal Reviews: Is the Process Flawed?” that Hank Korth organized at the 2008 CRA Snowbird Conference (the report of the panel appeared in SIGMOD Record and can be accessed here).

This discussion needs to start with our objectives. In an ideal world, what we want are:

  • Fast decisions on our papers so we know the result quickly;
  • Fast dissemination of the accepted results;
  • Meaningful and full reviews of our submissions; and
  • Fuller description of our research.

The conventional wisdom is that conferences are superior on the first two points and the third point is something we can tinker with (and we have been tinkering with for quite a while with mixed results) while the fourth objective is addressed by a combination of increasing conference paper page limits, decreasing font sizes so we can pack more material per page, and the practice of submitting fuller versions of conference papers to journals. Data suggest that the first issue does not hold – our top journals now have first round review times that are competitive with “traditional” conferences (e.g., SIGMOD and ICDE). The second issue can be addressed by adopting a publication business model that relies primarily on on-line dissemination with print copies released once per volume – this way you don’t wait for print processing, nor do you have to worry about page budgets and the like. Note that I am not talking about “online-first” models, but actually publishing the final version of the paper online as soon as the final version can be produced after acceptance. Journals perform much better on the last two points.

In my view, in the long run, we will follow other science and engineering disciplines and start treating journals as the main outlet for disseminating our research results. However, the road from here to there is not straightforward and there are a number of alternatives that we can follow. Accepting the fact that we, as a community, are not yet willing to give up on the conference model of publication, what are some of the measures we can take? Here are some suggestions:

  1. We should move away from a single submission deadline. With a single deadline per year, we tend to submit papers that may not yet be ready since the overhead of waiting for an entire year is far too high. This has many drawbacks as one can imagine. Multiple and frequent submission deadlines encourage authors to continue working and submit at the next deadline. The frequency of submissions should be at least bi-monthly. Less frequent submissions are problematic in that the temptation to submit even if a paper is not yet ready will be too high.
  2. We should move to a journal-style reviewing process. This means in-depth reviews with multiple rounds where the authors can engage in a discussion with the reviewers and editors. This is perhaps the most important advantage of journals as it encourages reviewers to invest the time to do it right rather than the uneven conference reviews that we frequently complain about. With multiple submission deadlines, the review load at each cycle should be more manageable and should allow for more proper reviews.
  3. Publication of the accepted papers can happen in a number of ways. Although I have a preference to publishing as soon as the papers are accepted, this is not the most important issue – if necessary for other reasons, the papers can be accepted throughout the year, but they can be published all at once at the conference, as proceedings are done today.

These are things that we currently do – Proceedings of VLDB (PVLDB) incorporates these suggestions. It represents the current thinking of the VLDB Endowment Board after many years of discussions. Although I had some reservations at the beginning, I have become convinced that it is better than our traditional conferences. However, I am suggesting going further:

  1. We should not have conference PCs that change every year. We should have have PCs that serve longer and provide some continuity – just like journal editorial boards do. In this context, we need to change our culture. My experience with serving on PVLDB as a reviewer this year was quite enlightening. It appears to me that people are still in the PC mode of thinking and not in journal review board mode of thinking – many still focus too much on acceptance rates and making binary decisions. Those of us who serve in these roles need to change our thought process – our job is not to reject papers, but to ensure that the good research results get to see the light of day.
  2. We should reorganize our PCs. We should have a small group of senior PC members and a larger group of reviewers. Senior PC members should be the senior members of the community. Together with multi-year service that I advocate above, this provides a means of training the junior members of the community to achieve a broader view of the field, how to evaluate novelty and originality, and how to write meaningful reviews. The senior PC members should really work as associate editors of journals and truly oversee the process, not just manage it.  Malcolm Gladwell, in his book, Outliers, talks about the “10,000 hour rule” indicating that success in any endeavor requires 10,000 hours of practice. I am not suggesting that junior members should serve 10,000 hours as reviewers, but doing this right so that it helps the science and the community takes practice and we should build it into our modus operandi. PVLDB is doing something this year along these lines – they have associate editors and reviewers, and associate editors assign papers to reviewers and then make decisions (like in a journal). I view this as a very positive step towards true journal-style reviewing.
  3. We should rethink the structure of our conferences. I have always been somewhat surprised that we allocate 20-25 minutes to presenting a paper whose text is already available but only 5 minutes for questions and discussions, all the while claiming that conferences are valuable for one-on-one discussion of research  (in addition to networking). The organization seems to me to be a one-way monologue rather than a discussion. It may help to reduce the presentation time and allocate more time for discussion of each paper (or group of papers). We should consider more extensive use of poster sessions where real discussions usually take place. To enforce the view that journal publications are no less valuable than conference publications, and to encourage direct submissions to journals, we should allocate some sessions at our conference to selected papers that have appeared in TODS in the previous year. This gives an opportunity for those papers to be discussed at the conference.

As I said earlier, my personal belief is that we will eventually shift our focus to journal publications. What I outlined above is a set of policies we can adopt to move in that direction. For an open membership organization such as SIGMOD, making major changes such as these requires full engagement of the membership. I hope we start discussing.

Blogger’s Profile:
M. Tamer Özsu is Professor of Computer Science at the David R. Cheriton School of Computer Science of the University of Waterloo. He was the Director of the Cheriton School of Computer Science from January 2007 to June 2010. His research is in data management focusing on large-scale data distribution and management of non-traditional data. His publications include the book Principles of Distributed Database Systems (with Patrick Valduriez), which is now in its third edition. He has also edited, with Ling Liu, the Encyclopedia of Database Systems. He serves as the Series Editor of Synthesis Lectures on Data Management (Morgan & Claypool) and on the editorial boards of three journals, and two book Series. He is a Fellow of the Association for Computing Machinery (ACM), and of the Institute of Electrical and Electronics Engineers (IEEE), and a member of Sigma Xi.

Big Data: It’s Not Just the Analytics

H.V Jagadish

I was recently approached by an entrepreneur who had an interesting way to correlate short term performance of a stock with news reports about the stock. Needless to say, there are many places from which one can get the news, and what results one gets from this sort of analysis does depend on the input news sources. Surprisingly, within two minutes the conversation had drifted from characteristics of news sources to the challenges of running SVM on Hadoop. The reason for this is not that Hadoop is the right infrastructure for this problem. But rather that the problem can legitimately be considered a Big Data problem. In consequence, in the minds of many, it must be addressed by running analytics in the cloud.

I have nothing against cloud services. In fact, I think they are an important part of the computational eco-system, permitting organizations to out-source selected aspects of their computational needs, and to provision peak capacity for load bursts. The map-reduce paradigm is a fantastic abstraction with which to handle tasks that are “embarrassingly parallelizable.” In short, there are many circumstances in which cloud services are called for. However, they are not always the solution, and are rarely the complete solution. For the stock price data analysis problem, based solely on the brief outline I’ve given you, one cannot say whether they are appropriate.

I have nothing against Support Vector Machines, or other machine learning techniques. They can be immensely useful, and I have used them myself in many situations. Scaling up these techniques for large data sets can be an issue, and certainly is a Big Data challenge. But for the problem at hand, I would be much more concerned about how it was modeled than how the model was scaled. What should the features be? Do we worry about duplicates in news appearances? Into how many categories should we classify news mentions? These are by far the more important questions to answer, because how we answer them can change what results we get: scaling better will only change how fast we get them.

It is hard to avoid mention of Big Data anywhere we turn today. There is broad recognition of the value of data, and products obtained through analyzing it. Industry is abuzz with the promise of big data. Government agencies have recently announced significant programs towards addressing challenges of big data. Yet, many have a very narrow interpretation of what that means, and we lose track of the fact that there are multiple steps to the data analysis pipeline, whether the data are big or small. At each step, there is work to be done, and there are challenges with Big Data.

The first step is data acquisition. Some data sources, such as sensor networks, can produce staggering amounts of raw data. Much of this data is of no interest, and it can be filtered and compressed by orders of magnitude. One challenge is to define these filters in such a way that they do not discard useful information. For example, in considering news reports, is it enough to retain only those that mention the name of a company of interest? Do we need the full report, or just a snippet around the mentioned name? The second big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured. This metadata is likely to be crucial to downstream analysis. For example, we may need to know the source for each report if we wish to examine duplicates.

Frequently, the information collected will not be in a format ready for analysis. The second step is an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis. A news report will get reduced to a concrete structure, such as a set of tuples, or even a single class label, to facilitate analysis. Furthermore, we are used to thinking of Big Data as always telling us the truth, but this is actually far from reality. We have to deal with erroneous data: some news reports are inaccurate.

Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis all of this has to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then “robotically” resolvable. Even for simpler analyses that depend on only one data set, there remains an important question of suitable database design. Usually, there will be many alternative ways in which to store the same information. Certain designs will have advantages over others for certain purposes, and possibly drawbacks for other purposes.

Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big-data computing environments. A problem with current Big Data analysis is the lack of coordination between database systems, which host the data and provide SQL querying, with analytics packages that perform various forms of non-SQL processing, such as data mining and statistical analyses. Today’s analysts are impeded by a tedious process of exporting data from the database, performing a non-SQL process and bringing the data back.

Having the ability to analyze Big Data is of limited value if users cannot understand the analysis. Ultimately, a decision-maker, provided with the result of analysis, has to interpret these results. Usually, this involves examining all the assumptions made and retracing the analysis. Furthermore, as we saw above, there are many possible sources of error: computer systems can have bugs, models almost always have assumptions, and results can be based on erroneous data. For all of these reasons, users will try to understand, and verify, the results produced by the computer. The computer system must make it easy for her to do so by providing supplementary information that explains how each result was derived, and based upon precisely what inputs.

In short, there is a multi-step pipeline required to extract value from data. Heterogeneity, incompleteness, scale, timeliness, privacy and process complexity give rise to challenges at all phases of the pipeline. Furthermore, this pipeline isn’t a simple linear flow – rather there are frequent loops back as downstream steps suggest changes to upstream steps. There is more than enough here that we in the database research community can work on.

To highlight this fact, several of us got together electronically last winter, and wrote a white paper, available at http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf . Please read it, and say what you think. The database community came very late to much of the web. We should make sure not to miss the boat on Big Data.

My post is loosely based on an extract from this white paper, which was created through a distributed conversation among many prominent researchers listed below.

Divyakant Agrawal, UC Santa Barbara
Philip Bernstein, Microsoft
Elisa Bertino, Purdue Univ.
Susan Davidson, Univ. of Pennsylvania
Umeshwar Dayal, HP
Michael Franklin, UC Berkeley
Johannes Gehrke, Cornell Univ.
Laura Haas, IBM
Alon Halevy, Google
Jiawei Han, UIUC
H. V. Jagadish, Univ. of Michigan (Coordinator)
Alexandros Labrinidis, Univ. of Pittsburgh
Sam Madden, MIT
Yannis Papakonstantinou, UC San Diego
Jignesh M. Patel, Univ. of Wisconsin
Raghu Ramakrishnan, Yahoo!
Kenneth Ross, Columbia Univ.
Cyrus Shahabi, Univ. of Southern California
Dan Suciu, Univ. of Washington
Shiv Vaithyanathan, IBM
Jennifer Widom, Stanford Univ.

Blogger’s Profile:
H. V. Jagadish is Bernard A Galler Collegiate Professor of Electrical Engineering and Computer Science and Director of the Software Systems Research Laboratory at the University of Michigan , Ann Arbor. He is well-known for his broad-ranging research on information management, and particularly its use in biology, medicine, telecommunications, finance, engineering, and the web. He is an ACM Fellow and founding Editor in Chief of PVLDB. He serves on the board of the Computing Research Association.

MADlib: An Open-Source Library for Scalable Analytics


tl;dr: MADlib is an open-source library of scalable in-database algorithms for machine learning, statistics and other analytic tasks. MADlib is supported with people-power from Greenplum; researchers at Berkeley, Florida and Wisconsin are also contributing. The project recently released a MADlib TR, and is now welcoming additional community contributions.

Warehousing → Science

Back in 2008, I had the good fortune to fall in with a group of data professionals documenting new usage patterns in scalable analytics. It was an interesting team: a computational advertising analyst at a large social networking firm, a seasoned DBMS consultant formerly employed at a major Internet retailer, a pair of DBMS engine developers and an academic.

The usage patterns we were seeing represented a shift from accountancy to analytics—from the cautious record-keeping of “Data Warehousing” to the open-ended, predictive task of “Data Science”. This shift was turning many Data Warehousing tenets on their heads. Rather than “architecting” an integrated permanent record that repelled data until it was well-conditioned, the groups we observed were interested in fostering a data-centric computational “watering hole”, where analysts could bring any kind of relevant data into a shared infrastructure, and experiment with ad-hoc integration and rich algorithmic analysis at very large scales.

In response to the dry TLAs of Data Warehousing, we dubbed this usage model MAD, to reflect

  • the Magnetic aspect of a promiscuously shared infrastructure
  • the Agile design patterns used for lightweight modeling, loading and iteration on data, and
  • the Deep statistical models and algorithms being used.

We wrote the MAD Skills paper in VLDB 2009 to capture these practices in broad terms. The paper describes the usage patterns mentioned above in more detail. It also includes a fairly technical section with a number of non-trivial analytics techniques adapted from the field, implemented via simple SQL excerpts.

MADlib (MAD Skills, the SQL)

When we released the MAD Skills paper, many people were interested not only in its design aspects, but also in the promise of sophisticated statistical methods in SQL. This interest came from multiple directions: DBMS customers were requesting it of consultants and vendors, and academics were increasingly publishing papers on in-database analytics. What was missing was a software framework to harness the energy of the community, and connect the various interested constituencies.

To this end, a group formed to build MADlib, a free, open-source library of SQL-based algorithms for machine learning, statistics, and related analytic tasks. The methods in MADlib are designed both for in- and out-of-core execution, and for the shared-nothing, “scale-out” parallelism offered by modern parallel database engines, ensuring that computation is done close to the data. The core functionality is written in declarative SQL statements, which orchestrate data movement to and from disk, and across networked machines. Single-node inner loops take advantage of SQL extensibility to call out to high-performance math libraries (currently, Eigen) in user-defined scalar and aggregate functions. At the highest level, tasks that require iteration and/or structure definition are coded in Python driver routines, which are used only to kick off the data-rich computations that happen within the database engine.

The primary goal of the MADlib open-source project is to accelerate innovation and technology transfer in the Data Science community via a shared library of scalable in-database analytics, much as the CRAN library serves the R community. Unlike CRAN, which is customized to the R analytics tool, we hope that MADlib’s grounding in standard SQL can result in community ports to a variety of parallel database engines.

Open-Source Algorithms in Parallel DBMSs?

The state of scalable analytics today depends very much on who you talk to.
When I talk about MADlib with academics and employees at Internet companies, they often ask why anyone would write an analytics library in SQL rather than Hadoop MapReduce. By contrast, when I talk with colleagues in enterprise software, they typically appreciate the use of SQL and mature DBMS infrastructure, but often ask why any vendor would support an open source effort like MADlib. There have been a few people—notably some collaborators at Greenplum—who share my view that the combination of SQL-compliance and open source is a natural and important catalyst for the Data Science community.

The motivation for considering parallel databases comes from both the database market and technology issues. There is a large and growing installed base of massively parallel commercial DBMSs in industry, fueled in part by a recent wave of startup acquisitions. Meanwhile, it is no surprise to database researchers that a massively parallel DBMS is a powerful platform for dataflow programming of sophisticated analytic algorithms. Research on sophisticated in-database analytics has been growing in recent years, in part as an offshoot of work on Probabilistic Databases. Education is hopefully shifting as well. For example, in my own CS186 database course this spring, the students not only wrote traditional SQL queries, they also had to implement a non-trivial social network analysis algorithm in SQL (betweenness centrality).

The open-source nature of MADlib represents a serious commitment by the entire team, and differs from the proprietary approaches traditionally associated with DBMS vendors. The decision to go open-source was motivated by a number of goals, including:

  • The benefits of customization: Statistical methods are rarely used as turnkey solutions. It’s typical for data scientists to want to modify and adapt canonical models and methods to their own purposes. Open source has major advantages in that context, and enables useful modifications to be shared back to the benefit of the entire community.
  • Closing the research-to-adoption loop: Very few traditional database customers have the capacity for significant in-house research into computing or data science. On the other hand, it is hard for academics doing computing research to understand and influence the way that analytic processes are done in the field. An open-source project like MADlib has the potential to connect these constituencies in a concrete way, to the benefit of all concerned.
  • Leveling the playing field, encouraging innovation: Many DBMS vendors offer various proprietary data mining toolkits consisting of textbook algorithms. It is hard to assess their relative merits. Meanwhile, Internet companies have been busily building machine learning code at scale for Hadoop and related platforms, but their code is not well-packaged for reuse (a fact recently confirmed for me by leaders at two major Internet companies.) The goal of MADlib is to fill this gap in the database context: offset the FUD of proprietary toolkits, bring a baseline level of algorithmic sophistication to users of database analytics, and help foster a connected community for innovation and technology transfer.

MADlib Status

MADlib is still young, at Version 0.3. The initial versions focused on establishing infrastructure and a baseline of textbook and some advanced methods; this initial suite actually covers a fair bit of ground (Table 1). Most methods were chosen because they were frequently requested from customers we met through contacts at Greenplum. More recently, we made a point of validating MADlib as a research vehicle, by fostering a small number of university groups who were working in the area to experiment with the platform and get their code disseminated. Profs. Chris Ré at Wisconsin and Daisy Wang at Florida have written up their work in a MADLib tech report that expands upon this post.

MADlib is currently ported to PostgreSQL (single-node, open-source) and Greenplum (shared-nothing parallel, commercial). Greenplum inherits the PostgreSQL extensibility interfaces almost completely, so these two ports were easy to pursue simultaneously in the early days of the project. Another attraction of Greenplum is that it offers a free download of a massively parallel DBMS for researchers, so there is no limitation on scaling experiments. (This is surprisingly unusual: most DBMS vendors still only advertise free trial downloads of “crippleware” that artificially limits database size or the number of nodes. I would imagine that market forces will change this story relatively soon.)

MADlib is hosted publicly at github, and readers are encouraged to browse the code and documentation via the MADlib website. The initial MADlib codebase reflects contributions from both industry (a team at Greenplum) and academia (Berkeley, Wisconsin, Florida). Project oversight and Quality Assurance efforts have been contributed by Greenplum. Our MADlib TR expands on the architecture and status, and also includes extensive discussion of related work.

Pitch in!

At this time, MADlib is ready to consider contributions from additional parties, including both new methods and ports to new platforms. Like any serious open-source project, contributions will have to be managed carefully to maintain code quality. I hope that more researchers will find it worthwhile to contribute serious code to the MADlib effort. It’s a bit more work than getting an algorithm ready to run experiments in a paper, but it’s really satisfying to develop and refine production-quality open-source code, and get it delivered to end-users. If you are doing research on scalable analytic methods, consider going the extra mile and contributing your code to the MADlib effort.

For more information on MADlib, please see the website at http://madlib.net.

Thanks to Chris Ré, Florian Schoppmann and Daisy Wang for their help writing up the recent MADlib TR that this post excerpts, and to Azza Abouzied, Peter Bailis, and Neil Conway for feedback on this version.

Blogger’s Profile:
Joseph M. Hellerstein is a Chancellor’s Professor of Computer Science at the University of California, Berkeley, whose work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Research Fellow and the recipient of two ACM-SIGMOD “Test of Time” awards for his research. In 2010, Fortune Magazine included him in their list of 50 smartest people in technology , and MIT’s Technology Review magazine included his Bloom language for cloud computing on their TR10 list of the 10 technologies “most likely to change our world”. A past research lab director for Intel, Hellerstein maintains an active role in the high tech industry, currently serving on the technical advisory boards of a number of computing and Internet companies including EMC, SurveyMonkey, Platfora and Captricity.

From 100 Students to 100,000

JenniferTypically I teach around 100 students per year in my introductory database course. This past fall my enrollment was a whopping 60,000. Admittedly, only 25,000 of them chose to submit assignments, and a mere 6500 achieved a strong final score. But even with 6500 students, I more than quadrupled the total number of students I’ve taught in my entire 18-year academic career.

The story begins a couple of years earlier, when Stanford computer science faculty started thinking about shaking up the way we teach. We were tired of delivering the same lectures year after year, often to a half-empty classroom because our classes were being videotaped. (The primary purpose of the videotaping is for Stanford’s Center for Professional Development, but the biggest effect is that many Stanford students skip lectures and watch them later online.) Why not “purpose-build” better videos: shorter, topic-specific segments, punctuated with in-video quizzes to let watchers check their understanding? Then class time could be made more enticing for students and instructor alike, with interactive activities, advanced or exotic topics, and guest speakers. This “flipped classroom” idea was evangelized in the Stanford C.S. department by Daphne Koller; I was one of the early adopters, creating my videos during the first few months of 2011. Recording was a low-tech affair, involving a computer, Cintiq tablet, cheap webcam and microphone, Camtasia software, and a teaching assistant to help with editing.

I put my videos online for the public, and soon realized that with a little extra work, I could make available what amounted to an entire course. With further help from the teaching assistant, I added slides (annotated as lecture notes, and unannotated for teaching use by others), demo scripts, pointers to textbook readings and other course materials, a comprehensive suite of written and programming exercises, and quick-guides for relevant software. The site got a reasonable amount of traffic, but the turning point came when my colleague Sebastian Thrun decided to open up his fall 2011 introductory artificial intelligence course to the world. After one email announcement promising a free online version of the Stanford AI course, including automatically-graded weekly assignments and a “statement of accomplishment” upon completion, Sebastian’s public course garnered tens of thousands of sign-ups within a week.

Having already prepared lots of materials, I jumped on the free-to-the-world bandwagon, as did my colleague Andrew Ng with his machine learning course. What transpired over the next ten weeks was one of the most rewarding things I’ve done in my life. The sign-ups poured in, and soon the “Q&A Forum” was buzzing with activity. The fact that I had a lot of materials ready before the course started turned out to be a bit deceptive—for ten weeks I worked nearly full-time on the course (never mind my other job as department chair, much less my research program), in part because there was a lot to do, but mostly because there was a lot I could do to make it even better, and I was having a grand time.

In addition to the video lectures, in-video quizzes, course materials, and self-guided exercises, I added two very popular components: quizzes that generate different combinations of correct and incorrect answers each time they’re launched (using technology pioneered a decade ago by my colleague Jeff Ullman in his Gradiance system), and interactive workbenches for topics ranging from XML DTD validation to view-update triggers. I offered midterm and final exams—multiple-choice, and crafted carefully so the problems weren’t solvable by running queries or checking Wikipedia. (Creating these exams, at just the right level, turned out to be one of the most challenging tasks of the entire endeavor.) To add a personal touch, and to amplify the strong sense of community that quickly welled up through the Q&A Forum, each week I posted a “screenside chat” video—modeled after Franklin D. Roosevelt’s fireside chats—covering topics ranging from logistical issues, to technical clarifications, to full-on cheerleading for those who were struggling.

Meanwhile back on the campus front, the Stanford students worked through exactly the same materials as the public students (except for the multiple-choice exams), but they did get something more for their money: hand-graded written problems with more depth than the automated exercises, a significant programming project, traditional written exams, and classroom activities ranging from interactive problem-solving to presentations by data architects at Facebook and Twitter. There’s no question that the Stanford students were satisfied: I’ve taught the course enough times to know that the uptick in my teaching ratings was statistically significant.

One interesting and surprisingly large effect of having 60,000 students is the need for absolute perfection: not one tiny flaw or ambiguity goes unnoticed. And when there’s a downright mistake, especially in, say, an exam question … well, I shudder to remember. The task of correcting small (and larger) errors and ambiguities in videos, quizzes, exercises, and other materials, was a continuing chore, but certainly instructive.

What kept me most engaged throughout the course was the attitude of the public students, conveyed primarily through emails and posts on the Q&A Forum. They were unabashedly, genuinely, deeply appreciative. Many said the course was a gift they could scarcely believe had come their way. As the course came to a close, several students admitted to shedding tears. One posted a heartfelt poem. A particularly noteworthy student named Amy became an absolute folk hero: Over the duration of the course Amy answered almost 900 posted questions. Regardless of whether the questions were silly or naive, complex or deep, her answers were patient, correct, of just the right length, included examples as appropriate, and were crafted in perfect English. Amy never revealed anything about herself (although she agreed to visit me after the course was over), despite hundreds of adoring public thank-you’s from her classmates, and one marriage proposal!

So who were these thousands and thousands of students? I ran a survey that revealed some interesting statistics. For example, although ages and occupations spanned the gamut, the largest contingent of students were software professionals wanting to sharpen their job skills. Many students commented that they’d been programming with databases for years without really knowing what they were doing. Males outnumbered females four to one, which is actually a little better than the ratio among U.S. college computer science majors. Students hailed from 130 countries; the U.S. had the highest number by a wide margin, followed by India and Russia. (China unfortunately blocked some of the content, although a few enterprising students helped each other out with workarounds.) On db-class.org you can find the full survey results via the FAQ page, as well as some participation and performance statistics.

Were there any negatives to the experience? Naturally there were a few complainers. For example, in my screenside chats I often referred to the “eager beavers” who were working well ahead of the schedule, and the “procrastinators” who were barely meeting deadlines. Most students enjoyed self-identifying into the categories (some eager-beavers even planned to make T-shirts), but a few procrastinators objected to the term, pointing out that they were squeezing the course between a full-time job or two and significant family obligations. A number of students were disappointed by the low-tech, non-Stanford-endorsed “statement of accomplishment” they received at the end; despite ample warnings from the start, apparently some students were still expecting official certification. I can’t help but wonder if some of those students were the same ones who cheated; I did appear to have quite a number of secondary accounts created expressly for achieving a perfect score. I made it clear from the start that I was assuming students were in it to learn, and cheating was not something I planned to prevent or even think about. Of course in the long run of online education, the interrelated topics of certification and cheating will need to be addressed.

So what happens next? Stanford is launching quite a few more courses in the same style, and I’ll offer mine again next fall. MIT has jumped on the bandwagon; other universities can’t be far behind. Independent enterprises such as the pioneering Khan Academy, and the recently-announced Udacity, are sure to play into the scene. There’s no doubt we’re at a major inflection point in higher education, both on campus and through internet distribution to the world. I’m thrilled to have been an early part of it.

Meanwhile here are a few more numbers: A few months after the initial launch we now have over 100,000 accounts, and we’ve accumulated millions of video views. Even with the course in a self-serve dormant state, each day there are a couple of thousand video views and around 100 assignments submitted for automated grading. All to learn about databases! Wow. Check it out at db-class.org.

Blogger’s Profile: Jennifer Widom is the Fletcher Jones Professor and (currently) Chair of the Computer Science Department at Stanford University. She was a Research Staff Member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering and the American Academy of Arts & Sciences; she received the ACM SIGMOD Edgar F. Codd Innovations Award in 2007.


In the era of blogs and social networks, ACM SIGMOD gets social!

The ACM SIGMOD Blog is the official blog site for ACM SIGMOD. This blog aims at catching the heartbeat of our community on exciting and controversial topics that are of interest to the community, and facilitate intelligent discussions among researchers on such topics. Its purpose is to be both interesting and fun.

The Blog will periodically host one featured blogger to share his/her view on a matter of interest. People can participate by leaving comments and opinions. In this way, the ensuing discussion can take a form that can hopefully lead to an interesting conclusion.

Who can be a featured blogger:

Anyone in our community! If you are passionate on a topic and would like to write a few paragraphs staking out your position, please contact Georgia Koutrika (by sending an email to sigmodblog [at] acm.org). We also plan to invite people to blog on selected topics.

Participating in a discussion:

You are welcome to participate in the discussion following the current blog post. When you do so, we encourage you to identify yourself by providing your name, and optionally your institution. In this way, everyone can get acknowledged, and discussions can be open, interesting and constructive.

Living in the era of social networking sites, we also offer the option for you to leave comments by signing up using your Facebook account. In this way, people will be able to see your photo and your name next to your comments! Comments will be moderated to ensure a healthy and fun discussion environment.