The first step that we took towards addressing the “too many visualizations” problem was to design a tool that can accelerate the search for desired visual patterns.
Our tool, Zenvisage, provides users a canvas where they can sketch or drag-and-drop a desired pattern, with the ability to tailor the matching criteria or space of visualizations. In Figure 4, we show Zenvisage being used to traverse the space of sold price per sqft trends over months for various cities in the United States. The visual “canvas” is shown in the center top panel. You can sketch a pattern in the canvas, like the linearly increasing one displayed below, or drag and drop any visualization into the canvas. Matches are displayed below the canvas, starting with a city called “The Village”. Representative patterns, such as Greater Carrollwood, and outliers, such as Temperance, are displayed in the right hand side panels to contextualize the search. The system also provides a number of knobs to fine-tune the search, including changing the similarity criteria, smoothing, or filtering. In our recent VLDB 2018 paper
, we developed lightweight approximation schemes to make this search efficient for arbitrary collections of visualizations.
Figure 4 (click the image to zoom in): The basic Zenvisage Interface. Attribute selection on the left, representative patterns and outliers on the right, with a canvas in the center top, where a user has drawn an increasing trend, with results displayed below.
For more advanced needs, Zenvisage provides a query language, called ZQL (Zenvisage Query Language), described in our VLDB 2017 paper
, that allows users to execute multi-step exploration workflows. Figure 5 top and bottom display the US states where the sold price per year trend is most similar to and most different from the sold price per square foot trend respectively: NV, AZ, and RI (LA, NE, KY) are the ones where these visualizations are similar (different). ZQL can also be used to summarize, correlate, drill-down, pivot, and filter collections of visualizations. The design of ZQL draws inspiration from Query by Example and visualization specification languages like VizQL
Figure 5 (click images to zoom in): The advanced Zenvisage Interface, with ZQL: the states where the soldprice by year and soldprice per square foot by year are most similar (top), and most different (bottom).
Another challenge with ZQL is how we can execute ZQL queries in an efficient manner, given the number of visualizations that need to be examined. We’ve developed an optimizer, titled SmartFuse, that combines parallelism, batching, and approximation to return results efficiently.
Zenvisage has been used by domain experts in astrophysics, genomics, and battery science—a number of features that we have added have been inspired by these target applications, over the course of a year-long participatory design process (see Figure 6 displaying our timeline and the features developed). One lesson that we took from this experience is that visualization needs are extremely domain-specific, requiring customization and fine-tuning for any new application. Our experience was crucial in identifying requirements for Zenvisage and ZQL.
Figure 6 (click the image to zoom in): Features developed for our domain experts
Another way to tame the “too many visualizations” problem is to have the system provide automatic recommendations of visualizations
. These could be visualizations that provide additional context, highlight interesting or unusual trends that are related, or cover attributes or data subsets that are underexplored.
In our basic Zenvisage interface, we provide simple recommendations of typical trends and outliers for the collection of visualizations currently being explored (as described in our CIDR 2017 paper
)—these are the right hand side panels of Figures 4 and 5.
Another approach is for users to provide cues for which data subsets they would like to see visualization recommendations
for. In our SeeDB system (described in papers in VLDB 2014
and VLDB 2016
) data scientists indicate a subset of data they are interested in, with the system providing visualization recommendations that highlight differences between that subset and the rest of the data. We found that SeeDB encouraged users to explore twice as many visualizations, and bookmark three times as many visualizations when compared to a tool that doesn’t make recommendations based on data differences.
Other Work and Future Directions
There is a variety of other work on trying to allow users to search for visual patterns or trends. Work from the data mining community has targeted the efficient search of a fixed set of time series patterns (as opposed to visualization collections constructed on the fly); work on TimeSearcher and Query by Sketch developed interfaces for visual search of time-series patterns. There has been other recent work on visualization recommendation, including Voyager, Datatone, Eviza, Vizdom, Profiler, and Data Polygamy. One issue with visualization recommendation in general is that it can encourage confirmation bias and lead to false discoveries; recent work has outlined several methods inspired from statistics to address this issue.
Despite all this work, the “too many visualizations” problem is still wide open. In a sense, the “too many visualizations” problem is the holy grail of data exploration: how do we find the hidden insights (needles) in a large dataset (haystack)
This manifests itself in many smaller questions: How do we support visualization recommendations when users do not know what they are looking for? How do we provide them a high-level summary of the dataset—of the attributes and trends? How can we provide recommendations that go beyond surface-level data difference and take into account semantics or user goals? At the same time, how can we build in the safeguards so that we do not fall into the statistical pitfalls that underlie p-hacking? More specifically, for systems like Zenvisage, can we come up with intuitive ways to express complex ZQL queries? How do we support data exploration tasks that go beyond pattern search and retrieval? How do we support immediate feedback and refinement? Moreover, is ZQL really the right layer of abstraction for novice data scientists, or should we be translating down from a different abstraction? What are the right metrics, and how do we evaluate the performance of visualization search and recommendation systems?
Visual data exploration is an important part of information visualization, and is a rich source for a number of important and challenging problems from a data management perspective—that reveal themselves as soon as we view a visualization as a humble SQL query.
While I focused on our work on accelerating comparisons and termination for the “too many tuples” problem, and accelerating search and discovery for the “too many visualizations” problem, there are still a number of interesting open data management questions for both of these problem domains.
At the same time, we must also acknowledge that these questions do go beyond data management: we need to understand how data scientists perceive visualizations and interact with them. Moreover, we also need to understand the ways visualizations are used, the typical exploration patterns, and the end-goals. Not only will this inform our research questions and metrics, it will also help us ensure that what we’re doing meets real-world needs and use-cases.
It is therefore also paramount that we work together with visualization researchers for formalizing these research questions, and for conducting user studies to ensure that our techniques are useful and usable in practice.
Overall, the data management community has the potential to take on a leading role in building scalable, usable, and powerful tools for data exploration—of which visualization is an integral part. I hope you’ll join us in this exciting journey!
Acknowledgments: A huge shout out to my students and collaborators in making these ideas come to life. In particular, the PhD students involved in the work mentioned in this blog post include Silu Huang, Albert Kim, Doris Lee, Stephen Macke, Sajjadur Rahman, Tarique Siddiqui, and Manasi Vartak. Our work has been done in collaboration with Profs. Karrie Karahalios, Sam Madden, and Ronitt Rubinfeld. Thank you to Eugene Wu for his constructive and detailed feedback on this blog post.
Aditya Parameswaran is an Assistant Professor in Computer Science at the University of Illinois (UIUC), where he has been since August 2014. He completed a yearlong postdoc at MIT CSAIL, following a PhD at Stanford. He develops systems and algorithms for “human-in-the-loop” data analytics. His website is at http://data-people.cs.illinois.edu, and his twitter handle is adityagp.
Copyright @ 2018, Aditya Parameswaran, All rights reserved.