We keep designing new data systems over and over again. We ask two questions in this post: 1) Is this sustainable in the long term? 2) Can we make this process faster?
The need for new data system designs
“Big data” may be mostly a marketing term as opposed to a research one but it is starting to have a tremendous impact on what we do research on. Practically what the big data era brought is that more and more individuals as well as organizations realize the value of gathering, storing and analyzing data. The net effect is that the amount and kinds of applications we need to support grow continuously. At the same time, hardware keeps evolving at a strong pace with new technologies arriving every few years. New data types, new models, a varying balance of incoming data vs. incoming queries as well as new hardware properties, they all need a different system design to achieve great performance.
We are quickly heading towards a future where numerous system designs will be necessary (sometimes even for a single application) and the need to spin off new designs will keep increasing with time as more domains embrace the value of doing business, science and everyday life management in a data-driven way. The question is whether we can keep creating and maintaining new data system designs at a pace that allows us to sustain this growth.
Problem 1: Data system design is a lengthy and complex process
Many would say that designing a data system is kind of an art. It is practically about balancing numerous tradeoffs and what is the best design can easily swing if one of the parameters changes.
Some of the fundamental design points include decisions on how to physically store and access data, designing the data structures, algorithms and data flow models that make data ingestion and query processing efficient. Each decision has side effects that are not always easy to foresee given the complexity of multiple interconnected system components. For example, the way we physically store data defines what is the optimal way to access or update it. At the same time every design choice needs to be tailored to the underlying hardware properties.
Today design still happens manually and requires the collective expertise, talent and intuition of the whole db community to move forward. As a result, it is an expensive and lengthy process; a complete new design takes multiple years to get right.
As a recent example let’s consider column-stores. The very first papers for the alternative data organization as columns appeared in the 80’s. The first ideas on how to efficiently store and access columns appeared in the 90’s, while the first quite complete designs appeared another 10-15 years later. Still today there are numerous open topics and the community keeps pushing with numerous exciting results year after year. This is perhaps an oversimplified summary of column-store evolution but gives a hint that new designs do not “just happen” even if the basic ideas do have strong potential.
Problem 2: How much of it is it really new?
For a data system designer there is nothing more fun than playing around with ideas, designs and implementations of data storage and access models. Trying to eliminate the last few cache misses, trying to get the CPU utilization to the maximum, trying to get a linear speed-up to the number of cores will always be thrilling. But how much of that is it really new concepts and drastically new solutions compared to past research and system designs?
For example, how many fundamentally new concepts can we find if we examine carefully all the data structures and algorithms optimized for different levels of the memory hierarchy as hardware evolves over the years? When the balance between CPU and I/O shifts, we favor solutions that optimize for the one that is currently the main bottleneck, e.g., these days by performing more computation in favor of minimizing I/O. Similarly, when the balance between read and write performance shifts due to new hardware properties, we always adjust our algorithms around the new characteristics, e.g., by batching data accordingly for writes or by favoring sequential access even if that means reading more data in some cases. When designing a new data structure, we always balance among the fundamental read, write and space amplifications tradeoffs depending on the exact applications requirements and hardware environment.
New solutions certainly pop up all the time but at the same time much of what we do seems to be variations of past concepts to adapt to a new environment by manually trying to find the right balance.
Making it easy to design new data systems
Being an inherently hard process, navigating the data system design space requires a lot of time and expertise. Both of these are scarce resources that are unlikely to scale at the same pace with the demands of a data-driven world. The question we ask here is whether we can automate part of the design process by leveraging the fact that a significant part of it relies on repeating concepts and tradeoffs.
If we could navigate the design space faster by automating part of the process, then a) designing new systems would be faster and b) part of the process may require significantly less expertise. Both of these benefits have a number of positive side effects, mainly towards making efficient data systems accessible and affordable much more broadly than today.
The goal is to automate as many parts of the process of data system design as possible. When we design manually, the process includes iterations of the following two steps: 1) hypothesizing that a solution X is an interesting one, 2) testing solution X. Step 1) requires creativity, intuition and experience, while Step 2) additionally requires modeling and engineering skills. Together with a talented PhD student at Harvard DASlab, Lukas Maas, we are pursuing a number of ideas on how to automate these two steps. One approach is by creating a pool of existing design options per system component and automatically testing multiple combinations of component designs until a “good” overall design is found. Iterating over multiple designs in an *evolutionary* way allows to test numerous design combinations, even ones that we might have otherwise missed if it is not easy to see that they are good candidates. For example, regarding physical data layouts, the system may iterate over known data organizations such as combinations of column-groups, array and tree-based storage options as well as versions of those tuned in multiple different ways. It can then pick the one that performs best for the given data, queries and hardware. Several challenges arise such as describing components and their variations in an abstract way, efficiently testing numerous solutions concurrently, allowing for injection of new design options, finding good starting points and heuristics to speed up the discovery of a good design, avoiding local minimum and many more.
The ideal result of such a research path would be a black box that we feed with data, queries and hardware specifications and spills out a system design and implementation. This may or may not be achievable in its entirety. However, even a more realistic intermediate state where only part of the process can be automated along with a group of expert system designers in the loop would have significant impact in shortening the process of system design to match new applications and hardware.
What can we learn from past research?
A lot. Making a complex process automatic and abstract seems to be a natural step in the evolution of computer science problems. The database community has a long history of such breakthroughs starting of course with the relational model, embracing declarative approaches and generating numerous adaptive techniques. Making it easy to design data systems requires utilization of everything we know about data systems and there is already research that could potentially be utilized as is.
For example, there is work on creating extensible database systems that consist of independent components that can be synthesized ; having some form of composability and modularity is required in order to be able to automatically synthesize alternative design solutions. In addition, there is work on declaratively defining storage layouts  or even complete systems ; such functionality allows us to be able to easily program and extend these systems. Moreover, there is work on adaptive data systems that can automatically adjust base data , indexing structures  or even autonomously decide which data to load  based on incoming data and queries; work on adaptive systems is important because it reduces the numbers of design choices needed. And of course there is work on auto-tuning advisors that help database administrators with the analysis of workloads, proposing good configurations of indexes and views ; work in these lines is important as we could envision similar “design advisors” for the whole data system design. The above examples are just a glimpse of what is relevant from past work in the db community and there is more from other areas as well such as software engineering and machine learning. To these we should add that discussions about similar concepts can be found to early texts by some of the founders of computer science such as by Turing about systems that can autonomously learn  and by Von Neumann about systems that can self-replicate .
A challenge for the whole db community
Achieving easy to design data systems will have a significant impact on how quickly we can build efficient systems. With a vast array of challenges to resolve, easy to design systems have something for everyone, requiring contributions from numerous areas of data management research and even beyond; there is room for innovation all the way from languages and user interfaces to physical layouts and optimization as well balancing potential tradeoffs between ease of design, time to deployment and efficiency.
*The discussion in this post is based on research funded by an NSF Career Award on Self-designing Data Systems, Grant No. IIS-1452595.
 Philippe Cudré-Mauroux, Eugene Wu, and Samuel Madden. The case for RodentStore: An adaptive, declarative storage system. In Proceedings of the biennial Conference on Innovative Data Systems Research (CIDR), 2009.
 Yannis Klonatos, Christoph Koch, Tiark Rompf, and Hassan Chafi. Building efficient query engines in a high-level language. Proceedings of the Very Large Data Bases Endowment (PVLDB), 7(10), 2014.
 Ioannis Alagiannis, Stratos Idreos, and Anastasia Ailamaki. H2O: A Hands-free Adaptive Store. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2014.
 Stratos Idreos, Martin L. Kersten, and Stefan Manegold. Database cracking. In Proceedings of the biennial Conference on Innovative Data Systems Research (CIDR), 2007.
 Stratos Idreos, Ioannis Alagiannis, Ryan Johnson, and Anastasia Ailamaki. Here are my data files. Here are my queries. Where are my results? In Proceedings of the biennial Conference on Innovative Data Systems Research (CIDR), 2011.
 Surajit Chaudhuri and Vivek R. Narasayya. An efficient cost-driven index selection tool for microsoft sql server. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 1997.
 A. M. Turing. Computing Machinery and Intelligence. Mind, 59:433–460, 1950.
 John Von Neumann. Theory of Self-Reproducing Automata. University of Illinois Press, 1966.
Blogger Profile: Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory@Harvard SEAS. Stratos works on data systems architectures with emphasis on designing systems that are easy to use, easy to design and can stand the test of time. For his doctoral work on Database Cracking, Stratos won the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award from the European Research Council on Informatics and Mathematics. In 2010 he was awarded the IBM zEnterpise System Recognition Award by IBM Research, and in 2011 he won the VLDB Challenges and Visions best paper award. In 2015 he received an NSF CAREER award for research on self-designing data systems and was awarded the 2015 IEEE TCDE Early Career Award from the IEEE Technical Committee on Data Engineering.
With the growing complexity of the Web, users often find themselves overwhelmed by the mass of choices available. For example, shopping for DVDs or clothes online becomes more and more difficult, as the variety of offers increases rapidly and gets unmanageable. To facilitate users in their selection process, recommender systems provide suggestions of potential interest on items. In particular, recommender systems aim at giving recommendations to users by estimating their item preferences and recommending those items featuring the maximal predicted preference. The prerequisite for determining such recommendations is information on the users’ interests, e.g., the users’ purchase history.
Recently, big data poses several new challenges for computing recommendations, such as the need for quick understanding and consuming data, as well as the need for integrating diverse, structured and unstructured data. This way, motivated by both current and traditional challenging problems related to interesting information hunting, we investigate the different aspects that are involved in the process of identifying valuable data items to suggest. We organize these aspects as follows:
a) Data enrichment and integration: In general, different types of information can be semantically enhanced to be used for computing recommendations.
b) Input information for recommendations: Enriched data for users, items and preferences can be used for different purposes towards producing recommendations. A key question is which is the appropriate set of users, or peers, that should be used for estimating the preferences of a given user. Let’s assume three kinds of peers: close friends, domain experts and similar users.
Regarding items, semantic-enhanced descriptions are used in either textual or tabular form. Regarding preferences, in addition to rating values, we have many other sources of information to consider during the recommendation process, like text reviews, user profile information and item dependencies. Users fine-grained like and dislike preferences are captured explicitly through multi-criteria or implicitly through sentiment analysis ratings, allowing for precise delineation of user profiles, whereas preferences are augmented with context and temporal information, allowing users to have different choices under different circumstances.
c) Recommendations output: Traditionally, recommender systems offer suggestions within a domain, i.e., when asking for movies or job vacancies, the suggestions consist only of movies or jobs. But why should one be limited to movies, when similar books exist as well? Such an example describes a cross-domain recommender system that can be realized due to data enrichment, offering knowledge about related items. In this line, packet recommendations (e.g., ) produce composite items consisting of a central item, possibly in the main domain of interest for a user, and a set of satellite items from different domains compatible with the central item. Compatibility can be assumed either as soft (e.g., other books that are often purchased together with the movie being browsed) or hard (e.g., battery packs that must be compatible with a laptop or a travel destination that must be within a certain distance from the main destination). Composite items can be further constrained by specific criteria, such as a price budget on purchases or a time budget on travel itineraries.
Apart from recommendations for single users, there are cases, such as visiting a restaurant or selecting a holiday destination, where a group of people participates in an activity. For such cases, group recommendations try to satisfy the preferences of all the group members (e.g., ). A different aspect of group recommendations appears when specific constraints apply to the members of the group . That is, constraints refer to preferences that the members of the group express for the other participants. For example, a vacation package may seem more attractive to a user, if the other members of the group are of a similar age, whereas a course may be recommended to a group of students that have similar or diverse backgrounds depending on the scope of the course. Constraints may describe limitations from the user/customer or the system/company perspective. In the latter, constraints refer to a set of properties that the group under construction must satisfy, expressing the requirements of the company concerning the group that an item is targeting on.
Since users usually have different preferences under different circumstances, for both single and group recommendations, context can be employed in conjunction with recommender systems . Furthermore, given that the granularity of a user’s taste that is captured by his profile is, in general, too coarse, recommender systems help users to express their needs by allowing them to provide examples based on which the system’s suggestions are identified (e.g., Pandora).
d) Recommendations explanation and visualization: The success of recommendations, i.e., their adoption by end users, relies on explaining the cause behind them. To this end, except for the suggested items, several approaches provide the user with an explanation for each suggested item, i.e., which is the reason that the specific item appears in the list of recommendations. In this direction, other approaches focus on the effective presentation of the recommended items to the end user, aiming at minimizing the browsing effort of the user and help him receive a broader view of them. For example,  exploits preferences defined by users upon items, extracts ranking of preferences that is used for ordering the suggested items and summarizes these items in a compact, yet intuitive and representative way.
e) Exploring your past: As data and knowledge bases get larger and accessible to a more diverse and less technically-oriented audience, new forms of data seeking become increasingly more attractive. A specific form of exploration is re-finding. In our context, re-finding aims to locate suggestions, possibly via browsing, that have been produced and seen by a user in the past. Unlike the typical task of constructing recommendations, here we face the task of recovery. Following the notion of personal dataspaces, it is challenging to integrate all data pertaining to a user from different sources, and allow him to visualize, search and explore his recommendations through a specific piece of information. Explicit (given by a user) or implicit (extracted, for instance, by his online traces, e.g., via Foursquare) feedback on suggestions, content and other users can significantly increase the quality of recommendations and searching features of a system. Recently,  introduces complementary types of feedback that can be achieved through the evolution of the interests of the users that belong to the social neighborhood of our respective user, or through his reactions, either attractions or aversions, towards past suggestions.
f) Lifelong recommendations and learning: The majority of recommendation approaches require the whole set of data (users, items and preferences) as input (static case), which is obsolete nowadays, due to the huge amount of generated data and the lifelong tracking of users presence online. Keeping track of the user history does not only result in more data for recommendations (useful for tackling the sparsity problem), but also allows for the study of possible changes in user tastes and identifying periodicity in his habits. A stream-mining inspired approach is that of data ageing that downgrades historical ratings as obsolete and pays more attention to recent ones that reflect the current user profile best (e.g., ). However, results on the effect of time in the quality of recommendations are contradictory some times, since approaches that discard past instances may lose too much signal. For such cases, more elaborate methods that separate transient factors from lasting ones appear to be beneficial. In the same scenario, the general notion of context, such as location and companion, can be employed as well. For implicitly extracting such sort of information, online reviews can be used. However, a long term user monitoring implies an extensive knowledge about user tastes and preferences, which might result in privacy risks for the user.
Recommendations for Data Exploration: Recommendations have always been an important area for both research and industry. Lately, they are being reshaped due to the huge amount of mostly heterogeneous data that are continuously collected from the Web. Data integration methods can handle the huge volumes of data, the different types of their heterogeneity and their evolution. Interestingly, the produced enriched data can be fed into the recommendation process and such wealth of data can facilitate the user experience with a recommender system.
Clearly, recommendations are considered as one of the main aspects of exploratory search, since they tend to anticipate user needs by automatically suggesting the information, which is most appropriate to the users and their current context. Their sophisticated capabilities are valuable for data discovery in numerous applications in various domains, such as social media, healthcare, telecommunication, e-commerce and Web analytics, business intelligence, and cyber-security. Moving forward, there is still a need to develop novel paradigms for user-data recommendations-like interactions that emphasize user context and interactivity with the goal of facilitating exploration, interpretation, retrieval, and assimilation of information. This year’s ExploreDB workshop, a premier workshop on Exploratory Search in Databases and the Web, co-located with ACM SIGMOD/PODS 2015, will cover the fascinating topic of recommendations, as well as encompass a wide range of research directions, highlighting data discovery and exploration.
 H. Kondylakis and D. Plexousakis. Exelixis: evolving ontology-based data integration system. In SIGMOD, 2011.
 H. Kondylakis and D. Plexousakis. Ontology evolution without tears. J. Web Sem., 19:42–58, 2013.
 W. Lu, S. Ioannidis, S. Bhagat, and L. V. S. Lakshmanan. Optimal recommendations under attraction, aversion, and social influence. In KDD, 2014.
 E. Ntoutsi, K. Stefanidis, K. Norvag, and H. Kriegel. Fast group recommendations by applying user clustering. In ER, 2012.
 E. Ntoutsi, K. Stefanidis, K. Rausch, and H. Kriegel. Strength lies in differences: Diversifying friends for recommendations through subspace clustering. In CIKM, 2014.
 S. B. Roy, S. Amer-Yahia, A. Chawla, G. Das, and C. Yu. Constructing and exploring composite items. In SIGMOD, 2010.
 K. Stefanidis, E. Ntoutsi, M. Petropoulos, K. Norvag, and H. Kriegel. A framework for modeling, computing and presenting time-aware recommendations. T. Large-Scale Data- and Knowledge-Centered Systems, 10:146–172, 2013.
 K. Stefanidis and E. Pitoura. Finding the right set of users: Generalized constraints for group recommendations. In PersDB, 2012.
 M. Zimmermann, E. Ntoutsi, and M. Spiliopoulou. Discovering and monitoring product features and the opinions on them with OPINSTREAM. Neurocomputing, 150:318–330, 2015.
Blogger Profile: Kostas Stefanidis is a research scientist at ICS-FORTH, Greece. Previously, he worked as a post-doctoral researcher at NTNU, Norway, and at CUHK, Hong Kong. He got his PhD in personalized data management from the Univ. of Ioannina, Greece. His research interests include recommender systems, personalized and context-aware data management, social networks, and information extraction, resolution and integration. He has co-authored more than 30 papers in peer-reviewed conferences and journals, including ACM SIGMOD, IEEE ICDE and ACM TODS. He is the General co-Chair of the Workshop on Exploratory Search in Databases and the Web (ExploreDB), and he will be the Web & Information Chair of SIGMOD/PODS 2016.
Blogger Profile: Eirini Ntoutsi is a researcher at LMU, Germany. She received her PhD in data mining from the Univ. of Piraeus, Greece. Previously, she worked as a data mining expert at OTE SA, the largest telecommunications operator in Greece. Her research interests lie in the areas of data mining, machine learning and databases, with a current focus on recommendations, opinionated streams and pattern stability analysis, stream mining and high dimensional data. She has co-authored more than 40 publications in international venues, including KDD, DKE and CIKM, serves as a reviewer in several conferences and journals, and co-organizes the BASNA workshop on Business Applications of Social Network Analysis.
Blogger Profile: Haridimos Kondylakis is a research scientist at ICS-FORTH, Greece. He received his PhD in Computer Science from the Univ. of Crete, Greece. His research interests span the following areas: Semantic Integration & Enrichment; Knowledge Evolution; Applications of Semantic Technologies to Information Systems. He has more than 40 publications in international conferences, books and journals including ACM SIGMOD, JWS and KER. He has also served as a reviewer in several journals and conferences, such as JWS, JODS, CIKM, EDBT, ISWC and as a PC member in premier conferences and workshops.
Big Data, and its 4 Vs – volume, velocity, variety, and veracity – have been at the forefront of societal, scientific and engineering discourse. Arguably the most important 5th V, value, is not talked about as much. How can we make sure that our data is not just big, but also valuable? WebDB 2015, the premier workshop on Web and Databases, focuses on this important topic this year. To set the stage, we have interviewed several prominent members of the data management community, soliciting their opinions on how we can ensure that data is not just available in quantity, but also in quality.
We interviewed Serge Abiteboul (INRIA Saclay & ENS Cachan), Oren Etzioni (Allen Institute for Artificial Intelligence), Divesh Srivastava (AT&T Labs-Research) with Luna Dong (Google Inc.), and Gerhard Weikum (Max Planck Institute for Informatics). We asked them about their motivation for doing research in the area of data quality, their current work, and their view on the future of the field.
Serge Abiteboul is a Senior Researcher INRIA Saclay, and an affiliated professor at Ecole Normale Supérieure de Cachan. He obtained his Ph.D. from the University of Southern California, and a State Doctoral Thesis from the University of Paris-Sud. He was a Lecturer at the École Polytechnique and Visiting Professor at Stanford and Oxford University. He has been Chair Professor at Collège de France in 2011-12 and Francqui Chair Professor at Namur University in 2012-2013. He co-founded the company Xyleme in 2000. Serge Abiteboul has received the ACM SIGMOD Innovation Award in 1998, the EADS Award from the French Academy of sciences in 2007; the Milner Award from the Royal Society in 2013; and a European Research Council Fellowship (2008-2013). He became a member of the French Academy of Sciences in 2008, and a member the Academy of Europe in 2011. He is a member of the Conseil national du numérique. His research work focuses mainly on data, information and knowledge management, particularly on the Web.
What is your motivation for doing research on the value of Big Data?
My experience is that it is getting easier and easier to get data but if you are not careful all you get is garbage. So quality is extremely important, never over-valued and certainly relevant. For instance, with some students we crawled the French Web. If you crawl naively, it turns out that very rapidly all the URLs you try to load are wrong, meaning they do not correspond to real pages, or they return pages without real content. You need to use something such as PageRank to focus your resources on relevant pages.
So then what is your current work for finding the equivalent of “relevant pages” in Big Data?
I am working on personal information where very often, the difficulty is to get the proper knowledge and, for instance, align correctly entities from different sources. My long-term goal also working for instance with Amélie Marian is the construction of a Personal Knowledge Base that gathers all the knowledge someone can get about his/her life. For each one of us, such knowledge has enormous potential value, but for the moment it lives in different silos and we cannot get this value.
This line of work is not purely technical, but involves societal issues as well. We are living in a world where companies and governments have loads of data on us and we don’t even know what they have and how they are using it. Personal Information Management is an attempt to rebalance the situation, and make personal data more easily accessible to the individuals. I have a paper on Personal Information Management Systems, talking about just that, to appear in CACM (with Benjamin André and Daniel Kaplan).
And what is your view of the killer app of Big Data?
Relational databases was a big technical success in the 1970s-80s. Recommendation of information was a big one in the 1990s-2000s, from PageRank to social recommendation. After data, after information, the next big technical success is going to be “knowledge”, say in the 2010s-20s :). It is not an easy sell because knowledge management has often been disappointing – not delivering on its promises. By knowledge management, I mean systems capable of acquiring knowledge at a large scale, reasoning with this knowledge, exchanging knowledge in a distributed manner. I mean techniques such as that used at Berkeley with Bud or at INRIA with Webdamlog. To build such systems, beyond scale and distribution, we have to solve quality issues: the knowledge is going to be imprecise, possibly missing, with inconsistencies. I see knowledge management as the next killer app!
Oren Etzioni is Chief Executive Officer of the Allen Institute for Artificial Intelligence. He has been a Professor at the University of Washington’s Computer Science department starting in 1991, receiving several awards including GeekWire’s Geek of the Year (2013), the Robert Engelmore Memorial Award (2007), the IJCAI Distinguished Paper Award (2005), AAAI Fellow (2003), and a National Young Investigator Award (1993). He was also the founder or co-founder of several companies including Farecast (sold to Microsoft in 2008) and Decide (sold to eBay in 2013), and the author of over 100 technical papers that have garnered over 22,000 citations. The goal of Oren’s research is to solve fundamental problems in AI, particularly the automatic learning of knowledge from text. Oren received his Ph.D. from Carnegie Mellon University in 1991, and his B.A. from Harvard in 1986.
Oren, how did you get started in your work on Big Data?
I like to say that I’ve been working on Big Data from the early days when it was only “small data”. Our 2003 KDD paper on predictive pricing started with a data set with 12K data points. By the time Farecast was sold to Microsoft, in 2008, we were approaching a trillion labeled data points. Big price data was the essence of Farecast’s predictive model, and had the elegant property that it was “self labeling”. That is, if we can label the airfare on a flight from Seattle to Boston with either a “buy now” or “wait” label—all we have to do is monitor the price movement over time to determine the appropriate label. 20/20 hindsight allows us to produce labels automatically. But for Farecast, and other applications of Big Data, the labeled data points are only part of the story. Background knowledge, reasoning, and more sophisticated semantic models are necessary to take predictive accuracy to the next level.
So what is the AI2 working on to bring us to this next level?
Beginning in January 1, 2014 we launched the Allen Institute for AI, a research center dedicated to leveraging modern data mining, text mining, and more in order to make progress on fundamental AI questions, and to develop high-impact AI applications.
And thinking ahead, what would be the killer application that you have in mind for Big Data?
Ideas like “background knowledge” and “common-sense reasoning” are investigated in AI whereas Big Data and data mining has developed into its own vibrant community. Over the next 10 years, I see the potential for these communities to re-engage with the goal of producing methods that are still scalable, but require less manual engineering and “human intelligence” to work. The killer application would be a Big Data application that easily adapts to a new domain, and that doesn’t make egregious errors because it has “more intelligence”.
Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his Ph.D. from the University of Wisconsin, Madison, and his B.Tech from the Indian Institute of Technology, Bombay. He is an ACM fellow, on the board of trustees of the VLDB Endowment, the managing editor of the Proceedings of the VLDB Endowment (PVLDB), and an associate editor of the ACM Transactions on Database Systems. His research interests and publications span a variety of topics in data management.
Xin Luna Dong is a senior research scientist at Google. She works on enriching and cleaning knowledge for the Google Knowledge Graph. Her research interest includes data integration, data cleaning, and knowledge management. Prior to joining Google, she worked for AT&T Labs – Research and received her Ph.D. in Computer Science and Engineering at the University of Washington. She is the co-chair for WAIM’15 and has served as an area chair for SIGMOD’15, ICDE’13, and CIKM’11. She won the best-demo award in SIGMOD’05.
Divesh and Luna, you have been working on several aspects of Big Data Value. What attracts you to this topic?
Value, the 5th V of big data, is arguably the promise of the big data era. The choices of what data to collect and integrate, what analyses to perform, and what data-driven decisions to make, are driven by their perceived value – to society, to organizations, and to individuals. It is worth noting that while value and quality of big data may be correlated, they are conceptually different. For example, one can have high quality data about the names of all the countries in North America, but this list of names may not have much perceived value. In contrast, even relatively incomplete data about the shopping habits of people can be quite valuable to online advertisers.
It should not be surprising that early efforts to extract value from big data have focused on integrating and extracting knowledge from the low-hanging fruit of “head” data – data about popular entities, in the current world, from large sources. This is true both in industry (often heavily relying on manual curation) and in academia (often as underlying assumptions of the proposed integration techniques). However, focusing exclusively on head data leaves behind a considerable volume of “tail” data, including data about less popular entities, in less popular verticals, about non-current (historical) facts, from smaller sources, in languages other than English, and so on. While each data item in the “long tail” may provide only little value, the total value present in the long tail can be substantial, possibly even exceeding the total value that can be extracted solely from head data. This is akin to shops making a big profit from a large number of specialty items, each sold in a small quantity, in addition to the profit made by selling large quantities of a few popular items.
We issue a call to arms – “leave no valuable data behind” in our quest to extract significant value from big data.
What is your recent work in this quest?
Our work in this area focuses on the acquisition, integration, and knowledge extraction from big data. More recently, we have been considering a variety of ideas, including looking at collaboratively edited databases, news stories, and “local” information, where multiple perspectives and timeliness can be even more important than guaranteeing extremely high accuracy (e.g., 99% accuracy requirement for Google’s Knowledge Graph).
We started this body of work a few years ago with the Solomon project for data fusion, to make wise decisions about finding the truth when faced with conflicting information from multiple sources. We quickly realized the importance of copy detection between sources of structured data to solve this problem, and developed techniques that iteratively perform copy detection, source trustworthiness evaluation, and truth discovery. The Knowledge Vault (KV) project and the Sonya project naturally extend the Solomon project to address the challenge of web-scale data. They focus on knowledge fusion, finding truthfulness of extracted knowledge from web-scale data (see here), and building probabilistic knowledge bases, in the presence of source errors and extraction errors, with the latter dominating (see here). The Sonya project in addition measures knowledge-based trust, determining the trustworthiness of web sources based on the correctness of the facts they provide.
Big data often has a temporal dimension, reflecting the dynamic nature of the real-world, with evolving entities, relationships and stories. Over the years we have worked on many big data integration problems dealing with evolving data. For example, our work on temporal record linkage addressed the challenging problem of entity resolution over time, which has to deal with evolution of entities wherein their attribute values can change over time, as well as the possibility that different entities are more likely to share similar attribute values over time. We have also looked at quality issues in collaboratively edited databases, with some recent work on automatically identifying fine-grained controversies over time in Wikipedia articles (upcoming paper in ICDE 2015).
More recently, we have been working on the novel topic of data source management, which is of increasing interest because of the proliferation of a large number of data sources in almost every domain of interest. Our initial research on this topic involves assessing the evolving quality of data sources, and enabling the discovery of valuable sources to integrate before actually performing the integration (see here and here).
Finally, we make a shameless plug for our new book “Big Data Integration” that should be published very soon, which we hope will serve as a starting point for interested readers to pursue additional work on this exciting topic.
And where do you think will research head tomorrow?
In keeping with our theme of “no valuable data left behind”, we think that effectively collecting, integrating, and using tail data is a challenging research direction for the big data community. There are many interesting questions that need to be answered. How should one acquire, integrate, and extract knowledge on tail entities, and for tail verticals, when there may not be many data sources providing relevant data? How can one understand the quality and value of tail data sources? How can such sources be used without compromising on value, even if the data are not of extremely high quality? How does one integrate historical data, including entities that evolve over time, and enable the exploration of the history of web data sources? In addition to freshness, what additional metrics are relevant to capturing quality over time? How does one deal with sources that provide data about future events? How can one integrate data across multiple languages and cultures? Answering these challenging questions will keep our community busy for many years to come.
Gerhard Weikum is a scientific director at the Max Planck Institute for Informatics in Saarbruecken, Germany, where he is leading the department on databases and information systems. He co-authored a comprehensive textbook on transactional systems, received the VLDB 10-Year Award for his work on automatic DB tuning, and is one of the creators of the YAGO knowledge base. Gerhard is an ACM Fellow, a member of several scientific academies in Germany and Europe, and a recipient of a Google Focused Research Award, an ACM SIGMOD Contributions Award, and an ERC Synergy Grant.
What is your motivation for doing research in the area of Big Data Value?
Big Data is the New Oil! This often heard metaphor refers to the world’s most precious raw asset — of this century and of the previous century. However, raw oil does not power any engines or contribute to high-tech materials. Oil needs to be cleaned, refined, and put in an application context to gain its true value. The same holds for Big Data. The raw data itself does not hold any value, unless it is processed in analytical tasks from which humans or downstream applications can derive insight. Here is where data quality comes into play, and in a crucial role.
Some applications may do well with huge amounts of inaccurate or partly erroneous data, but truly mission-critical applications would often prefer less data of higher accuracy and correctness. This Veracity dimension of the data is widely underestimated. In many applications, the workflows for Big Data analytics include major efforts on data cleaning, to eliminate or correct spurious data. Often, a substantial amount of manual data curation is unavoidable and incurs a major cost fraction.
OK, I see. Then what is your recent work in the area of oil refinery?
Much of the research in my group at the Max Planck Institute could actually be cast under this alternative – and much cooler – metaphor: Big Text Data is the New Chocolate!
We believe that many applications would enormously gain from tapping unstructured text data, like news, product reviews in social media, discussion forums, customer requests, and more. Chocolate is a lot more sophisticated and tasteful than oil — and so is natural-language text. Text is full of finesse, vagueness and ambiguities, and so could at best be seen as Uncertain Big Data. A major goal of our research is to automatically understand and enrich text data in terms of entities and relationships and this way enable its use in analytic tasks — on par with structured big data.
We have developed versatile and robust methods for discovering mentions of named entities in text documents, like news articles or posts in social media, and disambiguating them onto entities in a knowledge base or entity catalog. The AIDA software is freely available as open source code. These methods allow us to group documents by entities, entity pairs or entity categories, and compute aggregates on these groups. Our STICS demonstrator shows some of the capabilities for semantic search and analytics. We can further combine this with the detection and canonicalization of text phrases that denote relations between entities, and we can capture special kinds of text expressions that bear sentiments (like/dislike/support/oppose/doubt/etc.) or other important information.
Having nailed down the entities, we can obtain additional structured data from entity-indexed data and knowledge bases to further enrich our text documents and document groups. All this enables a wealth of co-occurrence-based analytics for comparisons, trends, outliers, and more. Obviously, for lifting unstructured text data to this value-added level, the Veracity of mapping names and phrases into entities and relations is decisive.
For example, when performing a political opinion analysis about the Ukrainian politician and former boxer Klitschko, one needs to be careful about not confusing him with his brother who is actively boxing. A news text like “former box champion Klitschko is now the mayor of Kiev” needs to be distinguished from “box champion Klitschko visited his brother in Kiev”. Conversely, a recent text like “the mayor of Kiev met with the German chancellor” should also count towards the politician Vitali Klitschko although it does not explicitly mention his name.
Why New Chocolate?
Well, in the Aztec Empire, cocoa beans were so valuable that they were used as currency! Moreover, cocoa contains chemicals that trigger the production of the neurotransmitter Serotonin in our brains – a happiness substance! Yes, you may have to eat thousands of chocolate bars before you experience any notable kicks, but for the sake of the principle: chocolate is so much richer and creates so much more happiness than oil.
Thanks for this culinary encouragement What do you think will be the future of the field?
Data quality is more than the quest for Veracity. Even if we could ensure that the database has only fully correct and accurate data points, there are other quality dimensions that often create major problems: incompleteness, bias and staleness are three aspects of paramount importance.
No data or knowledge base can ever be perfectly complete, but how do we know which parts we do not know? For a simple example, consider an entertainment music database, where we have captured a song and ten different cover versions of it. How can we tell that there really are only ten covers of that song? If there are more, how can we rule out that our choice of having these ten in the database is not biased in any way – for example, reflecting only Western culture versions and ignoring Asian covers? Tapping into text sources, in the New Chocolate sense, can help completing the data, but is also prone to “reporting bias”. The possibility that some of the data is stale, to different degrees, makes the situation even more complex.
Finally, add the Variety dimension on top of all this — not a single database but many independent data and text sources with different levels of incompleteness, bias, and staleness. Assessing the overall quality that such heterogeneous and diverse data provides for a given analytic task is a grand challenge. Ideally, we would like to understand how the quality of that data affects the quality of the insight we derive from it. If we consider data cleaning measures, what costs do we need to pay to achieve which improvements in data quality and analytic-output quality? I believe these are pressing research issues; their complexity will keep the field busy for the coming years.
Do you agree with Serge Abiteboul that knowledge management will be the killer app of Big Data? Do you share Gerhard Weikum’s opinion that Big Text Data is the New Chocolate? Do you have ideas on how to achieve the “no valuable data left behind” mantra that Divesh Srivastava and Luna Dong evoke? Does your work marry the domains of AI and Big Data, as Oren Etzioni proposes? We would be delighted to hear your opinion and your latest contribution in this field! This year’s WebDB workshop, which will be co-located with ACM SIGMOD, will provide a premier venue for discussing issues of big data quality. Its theme is “Freshness, Correctness, Quality of Information and Knowledge on the Web”. This theme encompasses a wide range of research directions, from focused crawling and time-aware search and ranking, to information extraction and data integration, to management, alignment, curation, and integration of structured knowledge, and to information corroboration and provenance. However, papers on all aspects of the Web and databases are solicited. We are looking forward to your submissions, and to interesting discussions about whether the future will bring us not just big data, but also good data!
Blogger Profile: Julia Stoyanovich is an Assistant Professor of Computer Science at Drexel University. She was previously a postdoctoral researcher and a CIFellow at the University of Pennsylvania. Julia holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst. After receiving her B.S. Julia went on to work for two start-ups and one real company in New York City, where she interacted with, and was puzzled by, a variety of massive datasets. Julia’s research focuses on developing novel information discovery approaches for large datasets in presence of rich semantic and statistical structure. Her work has been supported by the NSF and by Google.
Blogger Profile: Fabian M. Suchanek is an associate professor at the Telecom ParisTech University in Paris. Fabian developed inter alia the YAGO-Ontology, one of the largest public knowledge bases on the Semantic Web, which earned him a honorable mention of the SIGMOD dissertation award. His interests include information extraction, automated reasoning, and knowledge bases. Fabian has published around 40 scientific articles, among others at ISWC, VLDB, SIGMOD, WWW, CIKM, ICDE, and SIGIR, and his work has been cited more than 3500 times.
Graph data management has seen a resurgence in recent years, because of an increasing realization that querying and reasoning about the structure of the interconnections between entities can lead to interesting and deep insights into a variety of phenomena. The application domains where graph or network analytics are regularly applied include social media, finance, communication networks, biological networks, and many others. Despite much work on the topic, graph data management is still a nascent topic with many open questions. At the same time, I feel that the research in the database community is fragmented and somewhat disconnected from application domains, and many important questions are not being investigated in our community. This blog post is an attempt to summarize some of my thoughts on this topic, and what exciting and important research problems I think are still open.
At its simplest, graph data management is about managing, querying, and analyzing a set of entities (nodes) and interconnections (edges) between them, both of which may have attributes associated with them. Although much of the research has focused on homogeneous graphs, most real-world graphs are heterogeneous, and the entities and the edges can usually be grouped into a small number of well-defined classes.
Graph processing tasks can be broadly divided into a few categories. (1) First, we may to want execute standard SQL queries, especially aggregations, by treating the node and edge classes as relations. (2) Second, we may have queries focused on the interconnection structure and its properties; examples include subgraph pattern matching (and variants), keyword proximity search, reachability queries, counting or aggregating over patterns (e.g., triangle/motif counting), grouping nodes based on their interconnection structures, path queries, and others. (3) Third, there is usually a need to execute basic or advanced graph algorithms on the graphs or their subgraphs, e.g., bipartite matching, spanning trees, network flow, shortest paths, traversals, finding cliques or dense subgraphs, graph bisection/partitioning, etc. (4) Fourth, there are “network science” or “graph mining” tasks where the goal is to understand the interconnection network, build predictive models for it, and/or identify interesting events or different types of structures; examples of such tasks include community detection, centrality analysis, influence propagation, ego-centric analysis, modeling evolution over time, link prediction, frequent subgraph mining, and many others [New10]. There is much research still being done on developing new such techniques; however, there is also increasing interest in applying the more mature techniques to very large graphs and doing so in real-time. (5) Finally, many general-purpose machine learning and optimization algorithms (e.g., logistic regression, stochastic gradient descent, ADMM) can be cast as graph processing tasks in appropriately constructed graphs, allowing us to solve problems like topic modeling, recommendations, matrix factorization, etc., on very large inputs [Low12].
Prior work on graph data management could itself be roughly divided into work on specialized graph databases and on large-scale graph analytics, which have largely evolved separately from each other; the former has considered end-to-end data management issues including storage representations, transactions, and query languages, whereas the latter work has typically focused on processing specific tasks or types of tasks over large volumes of data. I will discuss those separately, focusing on whether we need “new” systems for graph data management and on open problems.
Graph Databases and Querying
The first question I am usually asked when I mention graph databases is whether we really need a separate database system for graphs, or whether relational databases suffice. Personally I believe that graph databases provide a significant value for a large class of applications and will emerge as another vertical.
If the goal is to support some simple graph algorithms or graph queries on data that is stored in an RDBMS, then it is often possible to do those using SQL and user-defined functions and aggregations. However, for more complex queries, a specialized graph database engine is likely to be much more user-friendly and likely to provide significant performance advantages. Many of the queries listed above either cannot be mapped to SQL (e.g., flexible subgraph pattern matching, keyword proximity search) or the equivalent SQL is complex and hard to understand or debug. An abstraction layer that converts queries from a graph query language to SQL could address some of these shortcomings, but that will likely only cover a small fraction of the queries mentioned above. More importantly, graph databases provide efficient programmatic access to the graph, allowing one to write arbitrary algorithms against them if needed. Since there is usually a need to execute some graph algorithms or network science tasks in the application domains discussed above, that feature alone makes a graph database very appealing. Most graph data models also support flexible schemas — although an orthogonal issue, new deployments may choose a graph database for that reason.
Whether a specialized graph database provides significant performance advantages over RDBMSs for the functionality common to both is somewhat less clear. For many graph queries, the equivalent SQL, if one exists, can involve many joins and unions and it is unlikely the RDBMS query optimizer could optimize those queries well (especially given the higher use of self-joins). It may also be difficult to choose among different ways to map a graph query into SQL. Queries that require recursion (e.g., reachability) are difficult to execute in a relational database, but are natural for graph databases. Graph databases can also employ specific optimizations geared towards graph queries and traversals. For example, graph databases typically store all the edges for a node with the node to avoid joins, and such denormalization can significantly help with traversal queries, especially queries that traverse multiple types of edges simultaneously (e.g., for subgraph pattern matching). Replicating node information with neighbors can reduce the number of cache misses and distributed traversals for most graph queries (at the expense of increased storage and update costs). Similarly, min cut-based graph partitioning techniques help in reducing the number of distributed queries or transactions, and similar optimizations can be effective in multi-core environments as well. On the other hand, there is less work on query optimization in graph databases, and for simple queries (especially simple subgraph pattern matching queries), the query optimizer in relational databases may make better decisions than many of today’s graph databases.
I think exploring such optimizations and understanding the tradeoffs better are rich topics for further research. For example, how the graph is laid out, both in persistent storage and in memory, can have a significant impact on the performance, especially in multi-core environments. We also need to better understand the common access patterns that are induced by different types of queries or tasks, and the impact of different storage representations on the performance of those access patterns. Another key challenge for graph querying is developing a practical query language. There is much theoretical work on this problem [Woo12] and several languages are currently used in practice, including SPARQL, Cypher, Gremlin, and Datalog. Among those, SPARQL and Cypher are based primarily on subgraph pattern matching and can handle a limited set of queries, whereas Gremlin is a somewhat low-level language and may not be easy to optimize. Datalog (used in LogicBlox and Datomic) perhaps strikes the best balance, but is not as user-friendly for graph querying and may need standardization of some of the advanced constructs, especially aggregates.
Unfortunately I see little work on these problems, and on end-to-end graph databases in general, in our community. There are quite a few graph data management systems being actively built in the industry, including Neo4j, Titan, OrientDB, Datomic, DEX, to name a few, where these issues are being explored. Much of the work in our community, on the other hand, is more narrowly focused on developing search algorithms and indexing techniques for specific types of queries; while that work has resulted in many innovative techniques, for wide applicability and impact, it is also important to understand how those fit into a general-purpose graph data management system.
Large-scale Graph Analytics
Unlike the above scenario, the case of new systems for graph analysis tasks, broadly defined to include graph algorithms and network science and graph mining tasks, is more persuasive. Batch analytics systems like relational data warehouses and MapReduce-based systems are not a good fit for graph analytics as is. From the usability perspective, it is not natural to write graph analysis tasks using those programming frameworks. Some graph programming models (e.g., the vertex-centric programming model [Mal10]) can be supported in a relational database through use of UDFs and UDAs [Jin14]; however it is not clear if the richer programming frameworks (discussed below) can also be efficiently supported. Further, many graph analysis tasks are inherently iterative, with many iterations and very little work per vertex per iteration, and thus the overheads of
those systems may start dominating and may be hard to amortize away.
A more critical question, in my opinion, is whether the popular vertex-centric programming model is really a good model for graph analytics. To briefly recap, in this model, users write vertex-level compute programs, that are then executed iteratively by the framework in either a bulk synchronous fashion or asynchronous fashion using message passing or shared memory. This model is well-suited for some graph processing tasks like computing PageRank or connected components, and also for several distributed machine learning and optimization tasks that can be mapped to message passing algorithms in appropriately constructed graphs [Low12]. Originally introduced in this context in Google’s Pregel system [Mal10], several graph analytics systems are built around this model (e.g., Giraph, Hama, GraphLab, PowerGraph, GRACE, GPS, GraphX).
However, most graph analysis tasks (e.g., the popular modularity optimization algorithm for community detection, betweenness centralities) or graph algorithms (e.g., matching, partitioning) cannot be written using the vertex-centric programming model while permitting efficient execution. The model limits the compute program’s access to a single vertex’s state and so the overall computation needs to be decomposed into smaller local tasks that can be (largely) independently executed; it is not clear how to do this for most of the computations discussed above, without requiring a large number of iterations. Even local neighborhood-centric analysis tasks (e.g., counting motifs, identifying social circles, computing local clustering coefficients) are inefficient to execute using this model; one could execute such a task by constructing multi-hop neighborhoods in each node’s local state by exchanging neighbor lists, but the memory required to hold that state can quickly make it infeasible [Qua14]. I believe these limitations are the main reason why most of the papers about this model focus on a small set of tasks like PageRank, and also why we don’t see broad adoption of this model for graph analysis tasks, unlike the MapReduce framework, which was very quickly and widely adopted.
Some of the alternative, and more expressive, programming models proposed in recent years include distributed Datalog-based framework used by Socialite [Seo13], data-centric programming models of Ligra [Shu13] and Galois [Ngu13], Green-Marl DSL [Hon12], and NScale framework from our recent work [Qua14]. Unlike the vertex-centric frameworks, however, distributed data-parallel execution is not straightforward for these frameworks, and investigating the trade-offs between expressiveness, ability to parallelize the computations, and ease-of-use remains a crucial challenge. The equivalence between graph analytics and matrix operations [Mat13], and whether that leads to better graph analysis systems, also need to be explored in more depth.
On a related note, the need for distributed execution of graph processing tasks is often taken as a given. However, graphs with 10′s to 100′s of billions of edges can be loaded onto a single powerful machine today, depending on the amount of information per node that needs to be processed (Ligra reports experiments on a graph with 12.9 billion edges with 256GB memory [Shu13]; in our recent work, we were able to execute a large number of streaming aggregate queries over a graph with 300 million edges on a 64GB machine [Mon14a]). Given the difficulty in distributing many graph analysis/querying tasks, it may be better to investigate approaches that eliminate the need to execute any single query or task in a distributed fashion (e.g., through aggressive compression, careful encoding of adjacency lists, or careful staging to disk or SSD (a la GraphChi)), while parallelizing independent queries/tasks across different machines.
Other Open questions
Despite much work, there are many important and hard problems that remain open in graph data management, in addition to the ones discussed above; more challenges are likely to come up as graph querying and analytics are broadly adopted.
Need for a Benchmark: Given the complex tradeoffs, many of the questions discussed above would be hard to answer without some representative workloads and benchmarks, especially because the performance of a system may be quite sensitive to the skew in the degree distribution and the graph diameter. Some of issues, e.g., storage representation, have been studied in depth in the context of RDF triple-stores, but the benchmarks established there appear to focus on scale and do not feature sufficient variety in the queries. A benchmark covering a variety of graph analysis tasks would also help significantly towards evaluating and comparing the expressive power and the performance of different frameworks and systems. Benchmarks would also help reconcile some of the recent conflicting empirical comparisons, and would help shed light on specific design decisions that impact performance significantly.
Temporal and real-time analytics: Most real-world graphs are highly dynamic in nature and often generate large volumes of data at a very rapid rate. Temporal analytics or querying over historical traces can lead to deeper insights into various phenomena, especially those related to evolution or change. One of the key challenges here is how to store the historical trace compactly while still enabling efficient execution of point queries and global or neighborhood-centric analysis tasks [Khu13]. Key differences from temporal databases, a topic that has seen much work, appear to be the scale of data, focus on distributed and in-memory environments, and the need to support global analysis tasks (which usually require loading entire historical snapshots into memory). Similarly real-time querying and analytics, especially anomaly detection, present several unique challenges not encountered in relational data stream processing [Mon14b].
Graph extraction: Another interesting, and practically important, question is how to efficiently extract a graph, or a collection of graphs, from non-graph data stores. Most graph analytics systems assume that the graph is provided explicitly. However, in many cases, the graph may have to be constructed by joining and combining information spread across a set of relations or files or key-value stores. A general framework that allows one to specify what graph or graphs need to be constructed for analysis, and how they map to the data stored in the persistent data stores would significantly simplify the end-to-end process of graph analytics. Even if the data is stored in a graph data store, often we only need to load a set of subgraphs of that graph for further analysis [Qua14], and similar framework would be needed to specify what subgraphs are to be extracted.
The above list is naturally somewhat skewed towards the problems we are working on in our group at Maryland, and towards developing general-purpose graph data management systems. In addition, there is also much work that needs to be done in the application areas like social media, finance, cybersecurity, etc.; in developing graph analytics techniques that lead to meaningful insights in those domains; in understanding what types of query workloads are typical; and in handling those over large volumes of data.
|Blogger Profile: Amol Deshpande is an Associate Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from University of California at Berkeley in 2004. His research interests include uncertain data management, graph analytics, adaptive query processing, data streams, and sensor networks.|