On Data Exploration in the era of Big Data

Big Data, data exploration, Interview
We are witnessing data of unprecedented volume, variety and velocity. Such data is collected from almost every aspect of human activity and stored in large repositories in order to be later analyzed and turned into useful insights. The storage model is not any more the one in which data is placed in predefined structures with well known semantics. Instead, the data is stored in raw format and data engineers are called to make sense of this data by experimenting with different data lookups or complex queries. This is known as data exploration. Driven by the keynote speakers in the ExploreDB workshop, alongside some major players in the market, we have asked Sihem Amer-Yahia (CNRS Research), Rick Cole (Tableau Software), Gautam Das (University of Texas, Arlington), Yanlei Diao (Universite Paris-Saclay) and Stratos Idreos (Harvard University) for their opinion and recent work on data exploration. Here is what they told us.

Q1: Tell us about your most recent development / research work related to data exploration and explain why you have considered it.

Sihem: I have been looking into the exploration of user data. User data can be acquired from various domains ranging from medical records to the social Web. An example is rated datasets that are characterized by a combination of user demographics such as age and occupation, and user actions such as rating a movie. User data exploration has been formulated as identifying group-level behavior such as “Asian women who publish regularly in databases.” Group-level exploration enables new findings and addresses issues raised by the peculiarities of user data, i.e. noise and sparsity. This kind of data exploration relates to a special field of business analytics, referred to as behavioral analytics whose goal is to unveil insights into the behavior of consumers on eCommerce platforms, IoT and mobile applications. The ability to explore user groups serves analysts in their role as data scientists and domain experts who seek to conduct large-scale population studies, and gain insights on various population segments. It is also appealing to users in their role as information consumers who use the social Web for routine tasks such as finding a book club or choosing a restaurant. Group-based exploration opens new research questions such as multi-objective user group discovery, hypothesis validation on user groups, and interactive user group analysis.

Rick: My work concerns query processing, such as query languages, query optimization, and query evaluation, as well as the overall provenance of query processing. Tableau provides a wealth of opportunity for research in all of these areas given its originating emphasis on visual query language, query optimization for disparate data sources, and high-performance query evaluation in the Hyper data engine. My recent work includes how to leverage fine-grained data lineage for human-in-the-loop error analysis during data preparation, the visual analysis of queries from programmatic generation to data source evaluation, cardinality estimation using data sketches, and solving query optimization problems using machine learning. Some of these projects are directly concerned with the user experience and others indirectly through improvements to analytic performance. I enjoy working on query processing end-to-end within Tableau’s query ecosystem.

Gautam: About two decades ago, relational databases and information retrieval were relatively independent fields, with their own communities of researchers and practitioners. The world of relational databases was black-and-white; data lived in a structured home, accessed by precise query languages such as SQL supporting a Boolean search and retrieval model. Information retrieval was a fuzzier world; data was unstructured, and there were no complex query languages – keyword search and relevance based ranked retrieval ruled. Recognizing the need for tighter integration/collaboration between the two fields, several researchers started to investigate common approaches to the problems in both areas. My own research has focused on developing IR search paradigms for relational databases, and my early work was on keyword search (DBXplorer project), automated ranking, and top-k querying in relational databases. More recently, I have been working in three data exploration areas. I have investigated faceted search techniques on structured as well as semi-structured data repositories (such as Wikipedia); such approaches enable the user to explore the data along different, relatively independent dimensions or facets. I have also become very interested in the empty-answers and many answers problems that naïve users often encounter in data exploration, since they often may not have a complete idea of what they may be looking for: they may over-specify the items of interest, and find no item in the source satisfying all the provided conditions, or they may under-specify the items of interest, and find too many items satisfying the given conditions. Our recent efforts have been focused on developing iterative query reformulation techniques by which the system guides the user in a systematic way through several small steps, where each step suggests slight query modifications, until the query reaches a form that generates desirable answers. Finally, I have been investigating regret-ratio and rank-regret problems. Given a dataset with multiple attributes, skylines and convex hulls are subsets that are guaranteed to contain the top choices of any monotonic or linear ranking functions. However, a major issue with such subsets is that they can be a significant portion of the dataset, especially when the data has many features or dimensions. One compelling approximation technique is to define the notion of “regret”, where the objective is to find a very small subset of the dataset, such that the top items in this subset have a score or rank within a user defined “regret” of the top item, no matter what ranking function is used. While conventional data summarization is based on preserving the overall data distribution, I find this type of summarization, where the objective is to pick a subset that preserves the distribution of the data “at the extremities”, very compelling and novel.

Yanlei: There is an increasing gap between the fast growth of data and the limited human ability to comprehend data. Consequently, there has been a growing demand of data management tools that can bridge this gap and help the user retrieve high-value content from data more effectively. In our project, we aim to build interactive data exploration as a new database service, using an approach called “explore-by-example”. In this framework, the database content is considered as a set of tuples, and the user is interested in some of them but not all. In the data exploration process, the system allows the user to interactively label tuples as “interesting” or “not interesting”, so that it can construct an increasingly-more-accurate model of the user interest. Eventually, the model is turned into a query that will retrieve all relevant tuples from the database.

Stratos: I work on data exploration from the system design point of view: How do we design storage and access methods that assist data scientists through long exploration steps? A key ingredient is making these interactions fast; then, users can quickly circle through different queries to build insights. Another critical element is designing systems that are aware of the fact that data exploration is a multi-step process with repeated data access and computation on overlapping data ranges. Our most recent work in this space is Data Canopy which makes repeated calculations of statistical measures faster by avoiding access and computations over the same data. Data Canopy breaks statistics down to their essential ingredients (aggregations) and computes those aggregations only once, remembers them, and out of those it synthesizes numerous statistical measures at query time without having to access the data or calculate everything from scratch. This way, individual data scientists performing lengthy exploration steps, zooming in and out of overlapping data sets, trying out alternative statistics (that share ingredients) see a significant speedup, making the whole process more interactive. At the same time, statistics are part of numerous essential pipelines and algorithms related to exploration and data science all of which can benefit when performing repeated access and computation. For example, linear regression, Bayesian classification, and collaborative filtering see orders of magnitude of speedups when using Data Canopy. Canopy is part of a longer-term project, which we call Queriosity (a portmanteau of Query and Curiosity), where we investigate opportunities to apply this logic of using fine-grained ingredients and synthesis across various data- and computation- hungry data science tasks.

Q2: How do you think data exploration has changed in the last 10 years with the appearance of Big Data?

Sihem: Data exploration has shifted from a database functionality to a sophisticated layer in the Big Data stack. While scalability remains a concern, a fundamental change is the integration of different modes of exploration, on both raw data and insights, that serve different user roles. Data exploration can be roughly classified into by-query, by-example, by-facet, and by-analytics. The first three modes are traditional. In by-query, the interaction between the user and the data is done through repeated querying. In by-example, the user wants to explore data items that are similar to some input examples. By-facet is a form of querying where the user explores data one attribute at a time. By-analytics exploration is more recent and relies on finding regions in the data whose distribution conforms to input ones. Additionally, the development of sophisticated data processing layers such as mining and machine learning blurs the boundaries between raw data and insights. Data explorers need tools to simultaneously examine both. Hence, the rise of different exploration modes.

The proliferation of exploration modes is also due to the empowerment of traditional data consumers and the need to provide them with increasingly sophisticated tools to find what they are looking for. Traditionally, Data Scientists have largely relied on by-query exploration to quickly discover subsets of interest in raw data. Lately, they have been using by-analytics to explore insights. Domain Experts who have a good understanding of their data, are not necessarily query experts. Their method of choice is often by-facet to see their data from different angles. However, by-example is also catching up with them when they explore insights, as they want a see-more feature. Data Consumers, the least expert of all, tend to use by-example exploration. If they know domain attributes (e.g., in case of Amazon or eBay), then by-facet exploration is also used. However, the afore-mentioned categorization between different user roles is fading away and as boundaries between experts and novice users are falling, we expect all users to use a combination of different exploration modes on both raw data and insights.

Rick: Data exploration has changed fundamentally with the realization of big data production systems. These systems contain massive volumes of structured and semi-structured data. They include aspects of our everyday lives and benefit more types of users. From analysts exploring datasets for insight, to dashboard interactors and consumers to metrics and alerting. The scope of addressable data, the different types of analyses, and the number of users interacting with that data, have all grown enormously. Of course, one cannot discuss this topic without including artificial intelligence. So much differentiated data requires the exploratory, recommendatory, decision-making, and natural language communication tools in the machine learning toolkit. Query processing itself can certainly benefit from machine learning as there are decision and modeling problems inherit in both query optimization and evaluation that have been solved heuristically or as one-size-fits-all.

Gautam: The data exploration field has changed in several fundamental ways in the last decade. The data itself is very different, and no longer just relational tables, or a document corpus. With ease of data collection via sensors (e.g., ubiquitous smart phones), it is much larger and more varied. We now have to contend with graphical data, such as semantic web and other data graphs on RDF, entity graphs, and social network data. Moreover, entity search has become popular – the unit of information to be retrieved is not just a row of a relational table or a document from a corpus, but “entities” that need to be identified and retrieved from the join of multiple data sources. Secondly, data, especially online content, is no longer the purview of large organizations. Creating content on the web, and consuming such content created by others, has widespread – a significant part of the population now frequently tweet, blog, interact with others on the web, and most importantly give opinions and feedback on existing content created by other users, thanks to online collaborative websites like Amazon, Yelp, Flickr, YouTube, etc., as well as social networking sites such as Twitter, Facebook, etc. This deluge of (almost) fully democratized user-content interaction data has led to the development of machine learning techniques to aid in the data exploration processes, resulting in exciting applications such as new generation recommendation systems. Overall, the Big Data era has witnessed increased awareness of the importance of data exploration, and this has extended beyond academia, with several companies/startups focusing on these problems. Big Data has also forced researchers to re-investigate old problems under new settings, and this has led to rise of sub-linear/sampling algorithms, as well as renaissance in design of new data structures for purely exploratory purposes (e.g., lots of research in speeding up IR queries).

Yanlei: Data exploration has changed in the last 10 years due to two trends: (1) The fast growth of data has created an increasing gap between the amount of data available and the limited human ability to comprehend data. (2) Concurrent with the data growth, there is a second trend called “democratization of data“, where the everyday users can be both the contributors and consumers of data. This further means that we need data management tools that can easily support everyday users for exploring large datasets.

Stratos: I think Big Data is quickly bringing exploration at the frontline as a fundamental paradigm. There are a couple of turning points. First, people in businesses understood that they could make money out of data; data hides insights that can be critical to making decisions about anything that has to do with a business. Second, it became effortless to create and store massive amounts of data. This is because today data is generated automatically from sensors, storage prices dropped, and there is no substantial effort or expertise needed to maintain massive clusters of machines that store the data; all one needs is a cloud account. This leads to the situation where we collect data about anything that might make sense and then we have the need and opportunity to drive insights out of this data.

Q3: How fundamentally different is data exploration from the traditional query answering? Is there space for unification?

Sihem: Query answering has traditionally relied on users with a precise knowledge of their data, i.e., content and schema, a precise knowledge of their need and intent, and of their ability to formulate queries. It has then evolved into applications where schemas and needs are only partially known, and researchers developed approaches such as query relaxation and why-not queries that introduce flexibility in traditional querying. In contrast, data exploration is founded on the assumption that users do not precisely know what they are looking for, let alone the content and schemas of the underlying datasets. It is hence by nature an iterative inquisitory process in which users probe the database in search for answers. One of the mechanisms of investigating answers could be queries and by-query exploration is one of many data exploration modes. An additional difference between querying and exploration is that while in querying and mining, users have largely been assumed to be alike, data exploration integrated user roles into its design. The development of different exploration modes, e.g., by-analytics or by-queries, reflects the distinction between users in technical and domain expertise. In other terms, the design of exploration strategies is intertwined with assumptions of who the target user is.

Given that in mind, the unification between traditional query answering and data exploration must rely on a better understanding of users and their usage patterns. In querying, this “understanding” has been interpreted as building user profiles, once and for all, and personalizing query results. Unifying querying and exploration can be done in different ways: adapting query-based exploration to the different user roles assumed in exploration, developing personalized exploration models that build user profiles on-the-fly based on previous iterations. These two proposals require the understanding of personalized querying and exploration modes.

Rick: Visual human-in-the-loop data exploration is about facilitating the intuitive flow of discovery and questioning. To that end, the user’s interface to the data must be immediately reactive, while every action often results in at least one query if not many queries in coordination. Maintaining interactive performance is paramount, requiring the consideration of the entire line of user thinking/querying so far, anticipating what they might do next, and suggesting what they might have missed. If data exploration is a journey that involves asking and answering questions, then data exploration and query answering are directly connected: there is not one without the other. In this world, the metadata is as important to the exploration process as the data itself. For example, how are these data sources, tables/containers, columns/fields, and data values related to each other? What is their true meaning? Such information is necessary in order to frame and expand the user’s thinking.

Gautam: Traditional query processing assumes a sophisticated user who is aware of the data repository that contains the information she is seeking, in particular its structure and metadata organization, is confident that the information she is seeking is confined within that repository, and is familiar with the (often complex) query language for retrieving the information from the repository. Data exploration is quite different. The user is assumed to be relatively “naïve” – while she may be an expert in the specific domain (e.g., a social scientist trying to search for information on social network data), she may not know how the data repository is organized, nor even whether the information she is seeking is confined to any one particular repository. Moreover, it is unrealistic to expect her to be familiar with complex query languages to articulate precisely what she is looking for. Information retrieval systems for such users need to consider the added challenges of having to interact with the user to understand more thoroughly what she is looking for, and then to search for it in (more than one) data sources. Unification is important, as in practice both types of users are going to access the same data sources, and it is not practical to develop two different information retrieval systems, each catering to one type of user. Some of our early work on keyword search in databases addresses this issue, where we tried to leverage existing SQL engines to also offer keyword search functionalities over relational database systems.

Yanlei: Given the above two trends, we envision that besides SQL, alternative interaces between users and databases will play an increasingly important role in system-aided data exploration. Natural language queries are one such interface and we have seen a lot of good work on this topic in recent years. Explore-by-example is another interface that may be found helpful in a number of application domains. There is certainly space for integration, where a lot of the techniques developed for query processing and optimization can be leveraged and improved upon to support the new querying or exploration interfaces.

Stratos: In traditional processing, we typically have a workload pattern we are trying to support, and so we can design and tune the system around this workload. In a data exploration scenario, on the other hand, every query depends on the previous one, and so there is no particular pattern; instead, the pattern emerges as the exploration process evolves. Exploration is about the sequence of steps until we find what we are looking for, while in traditional processing we start already knowing what we are looking for and we get the answer in a single step/query. In traditional processing, an excellent solution makes this single step fast. In data exploration, an excellent solution makes the whole sequence of steps fast. Thus, we could say that they are drastically different paradigms, but it is better to think of data exploration as a generalization of traditional processing. A great exploration setting is a system that performs on its own a given set of queries (e.g., queries that are always interesting to resolve), presents the results in an easy to consume way, and then enables users to continue exploring with ad-hoc queries. This is, in fact, a unified scenario where we need both paradigms.

Q4: What are the main application domains that you see data exploration playing a significant role?

Sihem: Application domains of data exploration are virtually endless but the ones that are most promising are those with a diverse set of users and the ability to gather their usage patterns.

Rick: For me, the most exciting application domain to benefit from modern data exploration is that of improving the human condition. Whether it is understanding the individual, society, or the environment, the exploration of data in this context is a powerful and enlightening journey. We now have access to so much more data that enables data exploration and analysis in service of humanity’s fundamental needs. For example, accessing water quality using satellite imagery or predicting illness by scanning blood vessels. Not only do we need experts to analyze and assess correlation, but we need to enable exploration of the analytic results by the public to facilitate discussion and inform policy. So, a key part of expanding the human benefit of data exploration is extending access and capability to more of the world, exploring and presenting data in ways that increases its utility with data consumers, developers, and explorers.

Gautam: As the saying goes, the modern world’s most valuable resource is not oil, but data. We are awash with data, and our data repositories are increasing in size and complexity at an exponential rate. Many exciting applications rely on being able to extract information, or “value” from the data. I will describe two application domains that are especially fascinating to me. For most users, their smartphones and tablets are the main access points to data repositories. Applications that can perform effective data exploration for such users can become significant game changers (Cortana, Siri?). However, the challenges are formidable. What does data exploration mean in the context of smartphones and tablets? How does one overcome the limitations of screen size and processing power? How do we interact via touch interfaces with a data exploration system? We have investigated this problem in the context of building mobile-friendly apps for accessing deep web data sources, using faceted search over a native top-k query interface. Secondly, we are seeing increased use of machine learning in data exploration. But is the reverse also possible, i.e., can data exploration help in machine learning and data science? Data science is rapidly emerging as a human-intensive workflow in which the analyst has to clean, annotate, label the data, and then iterate over engineering different features, and building ML models for different slices of the data. How can data exploration help in ML tasks such as feature engineering? How can it help in identifying interesting data segments in which the ML models predict surprising outcomes?

Yanlei: There are many application domains of data exploration, ranging from scientific applications to web applications backed by large databases. In the scientific domain, for instance, we are currently talking to biomedical researchers who want to explore large patient databases together with mutation databases. For every users of web applications, data exploration can enable them to find relevant information, e.g., houses or laptops that they would consider buying, with much reduced effort.

Stratos: I think it is going to be everywhere. Certainly analyzing all kinds of logs in businesses is a major application. Another area where the data exploration paradigm is going to be extremely interesting is applications where decisions lead to expensive actions and where collecting data is expensive and slow. For example, consider oil drilling or metal discovery; picking the right area to search has to be a very carefully balanced decision as it costs millions to perform those actions, and even collecting the data so that we can make those decisions is not a straightforward process.

Q5: What are the next big challenges in data exploration according to you?

Sihem: To acquire a better understanding of the next frontier in data exploration, it is necessary to gather exploration logs from users with different roles. Behavioral analytics, and more particularly cohort analysis, can help determine who looks for what and which exploration features are most appealing. That will result in “exploration profiles” that will serve as a basis for new research on the incorporation of human factors into Big Data stacks. That would also help better structure and address the question of validating one mode of exploration or another. Today, we design new exploration primitives, deploy them and then ask the question of how to evaluate them and for which tasks. Ideally, we would like to have a better integration of deployment and evaluation and a feedback loop. A key challenge is to seamlessly integrate exploring raw data and insights and leverage feedback to switch from one exploration mode to another.

Rick: Improving the accessibility to data exploration and the consumability of its results is a big challenge today and in the future. As the scope of data and the sophistication of data analysis increases, such as the application of machine learning to data on web data, we need improved usability and at the same time we need to know that the conclusions are in some sense valid and unbiased. Ideally, the answers to my queries need to include a measure of uncertainty and a way to understand the lineage of how this answer (conclusion) was reached. Another big challenge, which is also not new, is that of data semantics and integration. As the world becomes more and more data driven, how is one piece of data related to another in a different dataset, from a different process, in a different context, from another part of the world? In fact, I need help discovering the data in the first place, I need help knowing what questions may be asked, and I need help interpreting the answers. A deep and holistic understanding of the provenance and semantics, of data and queries, will enable the formulation of better questions, the computation of better answers, and a better understanding of the meaning of those answers. Automated insights into the data, recommendations to join additional data, and machine learning models, should be pre-computed and presented to the data explorer in an intuitive manner. Alongside the ability to understand the relationships of datasets, schema, and the data itself, data provenance –including the provenance of questions asked, i.e., the provenance of queries– and lineage of data at its finest level, will play an important role.

Gautam: As the use of machine learning in data exploration increases, I see several interesting challenges for the future. In the case of data exploration over online content, a complicated issue that needs to be addressed is, should these ML techniques optimize for the user, or for the content provider, or for the content hosting site? Often, all three are different – e.g., buyers and sellers on Craigslist are often in competition with each other. Should this be modeled in game-theoretic terms? Even for the user, the optimization goals can differ. Should we retrieve based on diversity, novelty/surprise, or fairness? Is it possible to design neuroscience inspired data exploration paradigms? Data privacy is an important issue that needs to be addressed in future data exploration algorithms. On one hand, user-content interaction data can be leveraged by ML models to build better data exploration algorithms. On the other hand, increased data privacy and access control mechanisms will eventually force DE systems to only partially track and log user interactions. Can effective data exploration be performed under such incomplete and potentially biased data? The use of crowdsourcing in data exploration tasks is intriguing. The wisdom of crowds has been used in many citizen science projects, e.g., DARPA Red Balloon Challenge. Can crowd workers be effectively used in data exploration? Recent work on the use of crowdsourcing in information retrieval is promising. Finally, I have already mentioned smartphone assistants, and data science. There will be others applications; the challenges of exploring and leveraging the information stored in large, heterogeneous, and unfathomable datasets are here to stay.

Yanlei: In the context of “explore by example”, I see big challenges coming from the need to explore mixed data types including structured data, text data, and perhaps image data. In addition, how to minimize the user effort in the data exploration process while achieving scalability in data size and dimensionality is another major issue.

Stratos: The next couple of steps should include: 1) developing benchmarks, 2) HCI/data system co-design, and 3) systems that automatically perform exploratory actions with help from machine learning. I am personally excited these days about an orthogonal direction: turning fundamental computer science problems into exploration problems to accelerate developer and researcher productivity and to enable systems with deep adaptivity capabilities. We started experimenting with this concept in a project we call the Data Calculator; it is an engine that allows interactive exploration of the design space of data structures. Like in standard data exploration scenarios the first step is collecting all the data we can. In this case, this means “mapping the design space” of data structure design, i.e., being able to systematically compute which are all possible data structures designs one could ever generate, and what would their performance properties be for a specific workload and hardware context. Once this is there, then the process of designing a new data structure or understanding the impact of new hardware and workloads becomes a data exploration task, i.e., to repeatedly query this massive design space until we find a design that works the best (or good enough). Like in standard data exploration scenarios, this becomes a sequence of steps, i.e., a series of queries to the system where every query depends on the previous one; the challenge is to improve the whole sequence so we can reach insights as fast as possible, eventually even automatically.

About the participants:

Sihem Amer-Yahia is a CNRS Research Director in Grenoble where she leads the SLIDE team. Her interests are at the intersection of large-scale data management and data analytics. Before joining CNRS, she held positions at QCRI, Yahoo! Research and at&t. Sihem served on the SIGMOD Executive Board, the VLDB Endowment, and the EDBT Board. She is Editor-in-Chief of the VLDB Journal and is chairing VLDB 2018, WWW 2018 tutorials and ICDE 2019 tutorials.

Rick Cole is a Senior Research Scientist at Tableau. His research concerns the processing of queries in Tableau’s heterogeneous federated data ecosystem. Recent projects include error analysis during data prep using fine-grained data lineage, data lineage for text processing, visualizing and interacting with queries, estimating join cardinality using data sketches, and applying reinforcement learning to query optimization. Rick earned his PhD from the University of Colorado at Boulder, where he was a member of the Volcano project for research into efficient, extensible tools for query processing. His research explored the optimization of dynamic query evaluation plans for robust query performance. Before joining Tableau, Rick co-founded Bright Vine, a data integration startup for cooperative analytics in diverse big data ecosystems. Previously he was at ParAccel, where he led development of a new query optimization framework and query optimizers, as well as a new extensibility framework for ParAccel’s parallel, columnar data engine. Prior to ParAccel, he was a technical leader at IBM, Informix Software, and Red Brick Systems.

Gautam Das is the Distinguished University Chair Professor in the Computer Science and Engineering Department and Research Head of the Database Exploration Laboratory (DBXLab) of the University of Texas at Arlington (UTA). Prior to UTA, Dr Das has held positions at Microsoft Research, Compaq Corporation and the University of Memphis, as well as visiting positions at IBM Research and the Qatar Computing Research Institute. He graduated with a BTech in computer science from IIT Kanpur, India, and with a PhD in computer science from the University of Wisconsin-Madison. Dr. Das’s research interests span data mining, information retrieval, databases, approximate query processing, applied graph and network algorithms, and computational geometry. He is a recipient of the IEEE ICDE “test of time” Influential Paper Award in 2012. Dr. Das is in the Editorial Board of ACM TODS and IEEE TKDE, has served as the General Chair of the flagship SIGMOD 2018 conference, as well as ICIT 2009, Program Chair of COMAD 2008, ICDE DBRank 2007, Best Paper Awards Chair of KDD 2006, Best Papers Awards committee of DAFSAA 2008, and Program Chair of ICIT 2004.

Yanlei Diao joined Ecole Polytechnique in France as Professor of Computer Science in 2015. She is also a tenured professor at the University of Massachusetts Amherst, USA. Her research interests lie in database systems and big data analytics, with a focus on big and fast data analytics, data stream streams and mining, interactive data exploration, genome data analysis, and uncertain data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005. She is Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Chair of the ACM SIGMOD Research Highlight Award Committee, member of the SIGMOD and PVLDB Executive Committees, and member of SIGMOD Software Systems Award Committee. In the past, she has served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.

Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory@Harvard SEAS. Stratos works on data system architectures with emphasis on how we can make it easy to design efficient data systems as applications and hardware keep evolving. For his doctoral work on Database Cracking, Stratos won the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.

Blogger Profiles

Melanie Herschel is a Professor for Data Engineering at the University of Stuttgart, Germany. Her research interests are in data exploration and analysis, data provenance and transparency, data wrangling, and optimizations for fast and user-friendly complex data processing. Her research results have been published at renowned conferences of the data management field and she has participated in several nationally funded research projects. In 2017, she received an IBM Faculty Award. Melanie is an associate editor for the VLDB journal, she regularly serves as a reviewer for conferences (including SIGMOD, VLDB, ICDE) and journals (e.g., VLDB Journal, IEEE TKDE, ACM JDIQ), and she participates in the organization of international events (recently, as SIGMOD 2016 publicity chair and EDBT 2019 PC chair, among others).

Yannis Velegrakis is a professor at the University of Trento, where he leads the Data Management Group, coordinates the EIT Digital MSc Program. His research area of expertise includes Big Data Understanding, Social Data Analysis, Highly Heterogeneous Information Integration, User-centric Querying Techniques Graph Management, and Data Quality. He holds a PhD degree from the University of Toronto. Before joining the University of Trento, he was a researcher at the AT&T Research Labs. He has spent time as a visitor at the IBM Almaden Research Centre, the University of California, Santa-Cruz, and the University of Paris-Saclay. He is an active member of the database community and has been the general chair for VLDB13. He has also been a recipient of a Marie Curie fellowship and of a Université Paris-Saclay Jean D’Alembert.

Copyright @ 2018, Melanie Herschel and Yannis Velegrakis, All rights reserved.