{"id":344,"date":"2012-04-06T18:24:34","date_gmt":"2012-04-06T18:24:34","guid":{"rendered":"http:\/\/wp.sigmod.org\/?p=344"},"modified":"2020-03-24T17:58:27","modified_gmt":"2020-03-24T17:58:27","slug":"madlib-an-open-source-library-for-scalable-analytics","status":"publish","type":"post","link":"https:\/\/wp.sigmod.org\/?p=344","title":{"rendered":"MADlib: An Open-Source Library for Scalable Analytics"},"content":{"rendered":"\n<p>\n\t<em>tl;dr: <a href=\"http:\/\/madlib.net\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>MADlib<\/strong><\/a> is an open-source library of scalable in-database algorithms for machine learning, statistics and other analytic tasks. MADlib is supported with people-power from Greenplum; researchers at Berkeley, Florida and Wisconsin are also contributing. The project recently released a <a href=\"http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2012\/EECS-2012-38.html\" target=\"_blank\" rel=\"noopener noreferrer\">MADlib TR<\/a>, and is now welcoming additional community contributions.<\/em>\n<\/p>\n<h3>Warehousing &rarr; Science<\/h3>\n<p>\n\tBack in 2008, I had the good fortune to fall in with a group of data professionals documenting new usage patterns in scalable analytics. It was an interesting team: a computational advertising analyst at a large social networking firm, a seasoned DBMS consultant formerly employed at a major Internet retailer, a pair of DBMS engine developers and an academic.\n<\/p>\n<p>\n\tThe usage patterns we were seeing represented a shift from accountancy to analytics\u2014from the cautious record-keeping of &#8220;Data Warehousing&#8221; to the open-ended, predictive task of &#8220;Data Science&#8221;. This shift was turning many Data Warehousing tenets on their heads. Rather than &#8220;<a href=\"http:\/\/www.dwinfocenter.org\/architect.html\" target=\"_blank\" rel=\"noopener noreferrer\">architecting<\/a>&#8221; an integrated permanent record that repelled data until it was well-conditioned, the groups we observed were interested in fostering a data-centric computational &#8220;watering hole&#8221;, where analysts could bring any kind of relevant data into a shared infrastructure, and experiment with ad-hoc integration and rich algorithmic analysis at very large scales.\n<\/p>\n<p>\n\tIn response to the dry <a href=\"http:\/\/en.wikipedia.org\/wiki\/Three-letter_acronym\" target=\"_blank\" rel=\"noopener noreferrer\">TLA<\/a>s of Data Warehousing, we dubbed this usage model <strong><i>MAD<\/i><\/strong>, to reflect\n<\/p>\n<p>the <em><strong>M<\/strong>agnetic<\/em> aspect of a promiscuously shared infrastructure<\/p>\n<p>the <em><strong>A<\/strong>gile<\/em> design patterns used for lightweight modeling, loading and iteration on data, and<\/p>\n<p>the <em><strong>D<\/strong>eep<\/em> statistical models and algorithms being used.<\/p>\n<p>\n\tWe wrote the <a href=\"http:\/\/www.vldb.org\/pvldb\/2\/vldb09-219.pdf\" target=\"_blank\" rel=\"noopener noreferrer\"><em>MAD Skills<\/em><\/a>\tpaper in VLDB 2009 to capture these practices in broad terms.  The paper describes the usage patterns mentioned above in more detail.  It also includes a fairly technical section with a number of non-trivial analytics techniques adapted from the field, implemented via simple SQL excerpts.\n<\/p>\n<h3>MADlib (MAD Skills, the SQL)<\/h3>\n<p>\n\tWhen we released the MAD Skills paper, many people were interested not only in its design aspects, but also in the promise of sophisticated statistical methods in SQL. This interest came from multiple directions: DBMS customers were requesting it of consultants and vendors, and academics were increasingly publishing papers on in-database analytics. What was missing was a software framework to harness the energy of the community, and connect the various interested constituencies.\n<\/p>\n<p>\n\tTo this end, a group formed to build <a href=\"http:\/\/madlib.net\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>MADlib<\/strong><\/a>, a free, open-source library of SQL-based algorithms for machine learning, statistics, and related analytic tasks. The methods in MADlib are designed both for in- and out-of-core execution, and for the shared-nothing, &#8220;scale-out&#8221; parallelism offered by modern parallel database engines, ensuring that computation is done close to the data. The core functionality is written in declarative SQL statements, which orchestrate data movement to and from disk, and across networked machines. Single-node inner loops take advantage of SQL extensibility to call out to high-performance math libraries (currently, <a href=\"http:\/\/eigen.tuxfamily.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Eigen<\/a>) in user-defined scalar and aggregate functions. At the highest level, tasks that require iteration and\/or structure definition are coded in Python driver routines, which are used only to kick off the data-rich computations that happen within the database engine.\n<\/p>\n<p>\n\tThe primary goal of the MADlib open-source project is to accelerate innovation and technology transfer in the Data Science community via a shared library of scalable in-database analytics, much as the <a href=\"http:\/\/cran.r-project.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">CRAN library<\/a> serves the R community. Unlike CRAN, which is customized to the R analytics tool, we hope that MADlib&#8217;s grounding in standard SQL can result in community ports to a variety of parallel database engines.\n<\/p>\n<h3>Open-Source Algorithms in Parallel DBMSs?<\/h3>\n<p>\nThe state of scalable analytics today depends very much on who you talk to.<br \/>\nWhen I talk about MADlib with academics and employees at Internet companies, they often ask why anyone would write an analytics library in SQL rather than <a href=\"http:\/\/hadoop.apache.org\" target=\"_blank\" rel=\"noopener noreferrer\">Hadoop<\/a> MapReduce.  By contrast, when I talk with colleagues in enterprise software, they typically appreciate the use of SQL and mature DBMS infrastructure, but often ask why any vendor would support an open source effort like MADlib.  There have been a few people\u2014notably some collaborators at Greenplum\u2014who share my view that the combination of SQL-compliance and open source is a natural and important catalyst for the Data Science community.\n<\/p>\n<p>\nThe motivation for considering parallel databases comes from both the database market and technology issues.\tThere is a large and growing installed base of massively parallel commercial DBMSs in industry, fueled in part by a recent wave of startup acquisitions. Meanwhile, it is no surprise to database researchers that a massively parallel DBMS is a powerful platform for dataflow programming of sophisticated analytic algorithms. Research on sophisticated in-database analytics has been growing in recent years, in part as an offshoot of work on Probabilistic Databases. Education is hopefully shifting as well. For example, in my own <a href=\"https:\/\/sites.google.com\/a\/cs.berkeley.edu\/cs186-s12\/\" target=\"_blank\" rel=\"noopener noreferrer\">CS186<\/a> database course this spring, the students not only wrote traditional SQL queries, they also had to implement a non-trivial social network analysis algorithm in SQL (<a href=\"http:\/\/github.com\/cs186\/sp12\/blob\/master\/hw2\/README.md\" target=\"_blank\" rel=\"noopener noreferrer\">betweenness centrality<\/a>).\n<\/p>\n<p>\n\tThe open-source nature of MADlib represents a serious commitment by the entire team, and differs from the proprietary approaches traditionally associated with DBMS vendors. The decision to go open-source was motivated by a number of goals, including:\n<\/p>\n<p><strong>The benefits of customization<\/strong>: Statistical methods are rarely used as turnkey solutions. It&#8217;s typical for data scientists to want to modify and adapt canonical models and methods to their own purposes. Open source has major advantages in that context, and enables useful modifications to be shared back to the benefit of the entire community.<\/p>\n<p><strong>Closing the research-to-adoption loop<\/strong>: Very few traditional database customers have the capacity for significant in-house research into computing or data science. On the other hand, it is hard for academics doing computing research to understand and influence the way that analytic processes are done in the field. An open-source project like MADlib has the potential to connect these constituencies in a concrete way, to the benefit of all concerned.<\/p>\n<p><strong>Leveling the playing field, encouraging innovation<\/strong>: Many DBMS vendors offer various proprietary data mining toolkits consisting of textbook algorithms. It is hard to assess their relative merits. Meanwhile, Internet companies have been busily building machine learning code at scale for Hadoop and related platforms, but their code is not well-packaged for reuse (a fact recently confirmed for me by leaders at two major Internet companies.) The goal of MADlib is to fill this gap in the database context: offset the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Fear,_uncertainty_and_doubt\" target=\"_blank\" rel=\"noopener noreferrer\">FUD<\/a> of proprietary toolkits, bring a baseline level of algorithmic sophistication to users of database analytics, and help foster a connected community for innovation and technology transfer.<\/p>\n<h3>MADlib Status<\/h3>\n<p>\n\tMADlib is still young, at Version 0.3. The initial versions focused on establishing infrastructure and a baseline of textbook and some advanced methods; this initial suite actually covers a fair bit of ground (<a href=\"#table1\">Table 1<\/a>). Most methods were chosen because they were frequently requested from customers we met through contacts at Greenplum. More recently, we made a point of validating MADlib as a research vehicle, by fostering a small number of university groups who were working in the area to experiment with the platform and get their code disseminated. Profs. Chris R\u00e9 at Wisconsin and Daisy Wang at Florida have written up their work in a <a href='\"http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2012\/EECS-2012-38.html\"'>MADLib tech report<\/a> that expands upon this post.\n<\/p>\n<p><a name=\"table1\"><\/p>\n<p>\n\t<img decoding=\"async\" src=\"http:\/\/wp.sigmod.org\/wp-content\/uploads\/2012\/04\/methods.png\" width='350'>\n<\/p>\n<p><\/a><\/p>\n<p>\n\tMADlib is currently ported to PostgreSQL (single-node, open-source) and Greenplum (shared-nothing parallel, commercial). Greenplum inherits the PostgreSQL extensibility interfaces almost completely, so these two ports were easy to pursue simultaneously in the early days of the project. Another attraction of Greenplum is that it offers a free download of a massively parallel DBMS for researchers, so there is no limitation on scaling experiments. (This is surprisingly unusual: most DBMS vendors still only advertise free trial downloads of &#8220;crippleware&#8221; that artificially limits database size or the number of nodes. I would imagine that market forces will change this story relatively soon.)\n<\/p>\n<p>\n\tMADlib is hosted publicly at <a href=\"http:\/\/github.com\/madlib\/madlib\" target=\"_blank\" rel=\"noopener noreferrer\">github<\/a>, and readers are encouraged to browse the code and documentation via the <a href=\"http:\/\/madlib.net\" target=\"_blank\" rel=\"noopener noreferrer\"> MADlib website<\/a>. The initial MADlib codebase reflects contributions from both industry (a team at Greenplum) and academia (Berkeley, Wisconsin, Florida). Project oversight and Quality Assurance efforts have been contributed by Greenplum. Our <a href=\"http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2012\/EECS-2012-38.html\" target=\"_blank\" rel=\"noopener noreferrer\">MADlib TR<\/a> expands on the architecture and status, and also includes extensive discussion of related work.\n<\/p>\n<h3>Pitch in!<\/h3>\n<p>\n\tAt this time, MADlib is ready to consider contributions from additional parties, including both new methods and ports to new platforms. Like any serious open-source project, contributions will have to be managed carefully to maintain code quality. I hope that more researchers will find it worthwhile to contribute serious code to the MADlib effort. It&#8217;s a bit more work than getting an algorithm ready to run experiments in a paper, but it&#8217;s really satisfying to develop and refine production-quality open-source code, and get it delivered to end-users. If you are doing research on scalable analytic methods, consider going the extra mile and contributing your code to the MADlib effort.\n<\/p>\n<p>\n\tFor more information on MADlib, please see the website at <a href=\"http:\/\/madlib.net\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/madlib.net<\/a>.\n<\/p>\n<p>\n\t<em>Thanks to Chris R\u00e9, Florian Schoppmann and Daisy Wang for their help writing up the recent <a href=\"http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2012\/EECS-2012-38.html\" target=\"_blank\" rel=\"noopener noreferrer\">MADlib TR<\/a> that this post excerpts, and to Azza Abouzied, Peter Bailis, and Neil Conway for feedback on this version.<\/em>\n<\/p>\n<p><\/span><\/p>\n<h4> Blogger&#8217;s Profile: <\/h4>\n<p><a href=\"http:\/\/db.cs.berkeley.edu\/jmh\" target=\"_blank\" rel=\"noopener noreferrer\"> Joseph M. Hellerstein <\/a> is a Chancellor&#8217;s Professor of Computer Science at the University of California, Berkeley, whose work focuses on data-centric systems and the way they drive computing. He is an <a href=\"http:\/\/fellows.acm.org\/fellow_citation.cfm?id=4354833&#038;srt=year&#038;year=2009\" target=\"_blank\" rel=\"noopener noreferrer\"> ACM Fellow, <\/a> an <a href=\"http:\/\/www.sloan.org\/fellowships\" target=\"_blank\" rel=\"noopener noreferrer\"> Alfred P. Sloan Research Fellow <\/a> and the recipient of two <a href=\"http:\/\/www.sigmod.org\/sigmod-awards\/sigmod-awards#time\" target=\"_blank\" rel=\"noopener noreferrer\">  ACM-SIGMOD &#8220;Test of Time&#8221; <\/a> awards for his research. In 2010, Fortune Magazine included him in their list of 50 <a href=\"http:\/\/money.cnn.com\/galleries\/2010\/technology\/1007\/gallery.smartest_people_tech.fortune\/27.html\" target=\"_blank\" rel=\"noopener noreferrer\"> smartest people in technology <\/a>, and MIT&#8217;s Technology Review magazine included his <a href=\"http:\/\/bloom-lang.org\" target=\"_blank\" rel=\"noopener noreferrer\">  Bloom language <\/a> for cloud computing on their <a href=\"http:\/\/www.technologyreview.com\/computing\/25089\/\" target=\"_blank\" rel=\"noopener noreferrer\"> TR10 <\/a> list of the 10 technologies &#8220;most likely to change our world&#8221;. A past research lab director for Intel, Hellerstein maintains an active role in the high tech industry, currently serving on the technical advisory boards of a number of computing and Internet companies including <a href=\"http:\/\/www.emc.com\" target=\"_blank\" rel=\"noopener noreferrer\"> EMC<\/a>, <a href=\"http:\/\/www.surveymonkey.com\" target=\"_blank\" rel=\"noopener noreferrer\">  SurveyMonkey<a>, <a href=\"http:\/\/www.platfora.com\" target=\"_blank\" rel=\"noopener noreferrer\">  Platfora<\/a> and <a href=\"http:\/\/www.captricity.com\" target=\"_blank\" rel=\"noopener noreferrer\">  Captricity<\/a>.<\/p>\n<div>\n<p> Copyright @ 2012,  Joseph M. Hellerstein, All rights reserved.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>tl;dr: MADlib is an open-source library of scalable in-database algorithms for machine learning, statistics and other analytic tasks. MADlib is supported with people-power from Greenplum; researchers at Berkeley, Florida and Wisconsin are also contributing. The project recently released a MADlib TR, and is now welcoming additional community contributions. Warehousing &rarr; Science Back in 2008, I [&hellip;]<\/p>\n","protected":false},"author":7,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[],"coauthors":[94],"class_list":["post-344","post","type-post","status-publish","format-standard","hentry","category-analytics"],"views":2251,"_links":{"self":[{"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=\/wp\/v2\/posts\/344","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=344"}],"version-history":[{"count":91,"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=\/wp\/v2\/posts\/344\/revisions"}],"predecessor-version":[{"id":3092,"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=\/wp\/v2\/posts\/344\/revisions\/3092"}],"wp:attachment":[{"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=344"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=344"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=344"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/wp.sigmod.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcoauthors&post=344"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}