March 22, 2018
Gautam: About two decades ago, relational databases and information retrieval were relatively independent fields, with their own communities of researchers and practitioners. The world of relational databases was black-and-white; data lived in a structured home, accessed by precise query languages such as SQL supporting a Boolean search and retrieval model. Information retrieval was a fuzzier world; data was unstructured, and there were no complex query languages – keyword search and relevance based ranked retrieval ruled. Recognizing the need for tighter integration/collaboration between the two fields, several researchers started to investigate common approaches to the problems in both areas. My own research has focused on developing IR search paradigms for relational databases, and my early work was on keyword search (DBXplorer project), automated ranking, and top-k querying in relational databases. More recently, I have been working in three data exploration areas. I have investigated faceted search techniques on structured as well as semi-structured data repositories (such as Wikipedia); such approaches enable the user to explore the data along different, relatively independent dimensions or facets. I have also become very interested in the empty-answers and many answers problems that naïve users often encounter in data exploration, since they often may not have a complete idea of what they may be looking for: they may over-specify the items of interest, and find no item in the source satisfying all the provided conditions, or they may under-specify the items of interest, and find too many items satisfying the given conditions. Our recent efforts have been focused on developing iterative query reformulation techniques by which the system guides the user in a systematic way through several small steps, where each step suggests slight query modifications, until the query reaches a form that generates desirable answers. Finally, I have been investigating regret-ratio and rank-regret problems. Given a dataset with multiple attributes, skylines and convex hulls are subsets that are guaranteed to contain the top choices of any monotonic or linear ranking functions. However, a major issue with such subsets is that they can be a significant portion of the dataset, especially when the data has many features or dimensions. One compelling approximation technique is to define the notion of “regret”, where the objective is to find a very small subset of the dataset, such that the top items in this subset have a score or rank within a user defined “regret” of the top item, no matter what ranking function is used. While conventional data summarization is based on preserving the overall data distribution, I find this type of summarization, where the objective is to pick a subset that preserves the distribution of the data “at the extremities”, very compelling and novel.
Rick: Data exploration has changed fundamentally with the realization of big data production systems. These systems contain massive volumes of structured and semi-structured data. They include aspects of our everyday lives and benefit more types of users. From analysts exploring datasets for insight, to dashboard interactors and consumers to metrics and alerting. The scope of addressable data, the different types of analyses, and the number of users interacting with that data, have all grown enormously. Of course, one cannot discuss this topic without including artificial intelligence. So much differentiated data requires the exploratory, recommendatory, decision-making, and natural language communication tools in the machine learning toolkit. Query processing itself can certainly benefit from machine learning as there are decision and modeling problems inherit in both query optimization and evaluation that have been solved heuristically or as one-size-fits-all.
Gautam: The data exploration field has changed in several fundamental ways in the last decade. The data itself is very different, and no longer just relational tables, or a document corpus. With ease of data collection via sensors (e.g., ubiquitous smart phones), it is much larger and more varied. We now have to contend with graphical data, such as semantic web and other data graphs on RDF, entity graphs, and social network data. Moreover, entity search has become popular – the unit of information to be retrieved is not just a row of a relational table or a document from a corpus, but “entities” that need to be identified and retrieved from the join of multiple data sources. Secondly, data, especially online content, is no longer the purview of large organizations. Creating content on the web, and consuming such content created by others, has widespread – a significant part of the population now frequently tweet, blog, interact with others on the web, and most importantly give opinions and feedback on existing content created by other users, thanks to online collaborative websites like Amazon, Yelp, Flickr, YouTube, etc., as well as social networking sites such as Twitter, Facebook, etc. This deluge of (almost) fully democratized user-content interaction data has led to the development of machine learning techniques to aid in the data exploration processes, resulting in exciting applications such as new generation recommendation systems. Overall, the Big Data era has witnessed increased awareness of the importance of data exploration, and this has extended beyond academia, with several companies/startups focusing on these problems. Big Data has also forced researchers to re-investigate old problems under new settings, and this has led to rise of sub-linear/sampling algorithms, as well as renaissance in design of new data structures for purely exploratory purposes (e.g., lots of research in speeding up IR queries).
Gautam: Traditional query processing assumes a sophisticated user who is aware of the data repository that contains the information she is seeking, in particular its structure and metadata organization, is confident that the information she is seeking is confined within that repository, and is familiar with the (often complex) query language for retrieving the information from the repository. Data exploration is quite different. The user is assumed to be relatively “naïve” – while she may be an expert in the specific domain (e.g., a social scientist trying to search for information on social network data), she may not know how the data repository is organized, nor even whether the information she is seeking is confined to any one particular repository. Moreover, it is unrealistic to expect her to be familiar with complex query languages to articulate precisely what she is looking for. Information retrieval systems for such users need to consider the added challenges of having to interact with the user to understand more thoroughly what she is looking for, and then to search for it in (more than one) data sources. Unification is important, as in practice both types of users are going to access the same data sources, and it is not practical to develop two different information retrieval systems, each catering to one type of user. Some of our early work on keyword search in databases addresses this issue, where we tried to leverage existing SQL engines to also offer keyword search functionalities over relational database systems.
Rick Cole is a Senior Research Scientist at Tableau. His research concerns the processing of queries in Tableau’s heterogeneous federated data ecosystem. Recent projects include error analysis during data prep using fine-grained data lineage, data lineage for text processing, visualizing and interacting with queries, estimating join cardinality using data sketches, and applying reinforcement learning to query optimization. Rick earned his PhD from the University of Colorado at Boulder, where he was a member of the Volcano project for research into efficient, extensible tools for query processing. His research explored the optimization of dynamic query evaluation plans for robust query performance. Before joining Tableau, Rick co-founded Bright Vine, a data integration startup for cooperative analytics in diverse big data ecosystems. Previously he was at ParAccel, where he led development of a new query optimization framework and query optimizers, as well as a new extensibility framework for ParAccel’s parallel, columnar data engine. Prior to ParAccel, he was a technical leader at IBM, Informix Software, and Red Brick Systems.
Gautam Das is the Distinguished University Chair Professor in the Computer Science and Engineering Department and Research Head of the Database Exploration Laboratory (DBXLab) of the University of Texas at Arlington (UTA). Prior to UTA, Dr Das has held positions at Microsoft Research, Compaq Corporation and the University of Memphis, as well as visiting positions at IBM Research and the Qatar Computing Research Institute. He graduated with a BTech in computer science from IIT Kanpur, India, and with a PhD in computer science from the University of Wisconsin-Madison. Dr. Das’s research interests span data mining, information retrieval, databases, approximate query processing, applied graph and network algorithms, and computational geometry. He is a recipient of the IEEE ICDE “test of time” Influential Paper Award in 2012. Dr. Das is in the Editorial Board of ACM TODS and IEEE TKDE, has served as the General Chair of the flagship SIGMOD 2018 conference, as well as ICIT 2009, Program Chair of COMAD 2008, ICDE DBRank 2007, Best Paper Awards Chair of KDD 2006, Best Papers Awards committee of DAFSAA 2008, and Program Chair of ICIT 2004.
Yanlei Diao joined Ecole Polytechnique in France as Professor of Computer Science in 2015. She is also a tenured professor at the University of Massachusetts Amherst, USA. Her research interests lie in database systems and big data analytics, with a focus on big and fast data analytics, data stream streams and mining, interactive data exploration, genome data analysis, and uncertain data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005. She is Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Chair of the ACM SIGMOD Research Highlight Award Committee, member of the SIGMOD and PVLDB Executive Committees, and member of SIGMOD Software Systems Award Committee. In the past, she has served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.
Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory@Harvard SEAS. Stratos works on data system architectures with emphasis on how we can make it easy to design efficient data systems as applications and hardware keep evolving. For his doctoral work on Database Cracking, Stratos won the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.
Yannis Velegrakis is a professor at the University of Trento, where he leads the Data Management Group, coordinates the EIT Digital MSc Program. His research area of expertise includes Big Data Understanding, Social Data Analysis, Highly Heterogeneous Information Integration, User-centric Querying Techniques Graph Management, and Data Quality. He holds a PhD degree from the University of Toronto. Before joining the University of Trento, he was a researcher at the AT&T Research Labs. He has spent time as a visitor at the IBM Almaden Research Centre, the University of California, Santa-Cruz, and the University of Paris-Saclay. He is an active member of the database community and has been the general chair for VLDB13. He has also been a recipient of a Marie Curie fellowship and of a Université Paris-Saclay Jean D’Alembert.
Copyright @ 2018, Melanie Herschel and Yannis Velegrakis, All rights reserved.