The International Workshop on Big Data Visual Exploration and Analytics (BigVis) is an annual event, which brings together scholars from the communities of Data Management & Mining, Information Visualization, Machine Learning and Human-Computer Interaction. The 7th BigVis event (BigVis 2024)1 was organized in conjunction with the 50th International Conference on Very Large Databases (VLDB 2024) in Guangzhou, China. The organizing committee invited 5 research experts to provide their insights and future research challenges related to Data Exploration and Visual Analytics.
The invited experts are Sihem Amer-Yahia (CNRS, Université Grenoble Alpes), Leilani Battle (University of Washington), Yifan Hu (Northeastern University), Dominik Moritz (Carnegie Mellon University & Apple) and Aditya Parameswaran (Berkeley University). This post presents their responses:
Sihem Amer-Yahia, CNRS, Université Grenoble Alpes, France
Future Challenges
A visualization is a pair of the form (data, visual element). For instance, data could be a distribution of gender values and the visual element could be a pie chart. A distribution could be mapped to various elements using algebras such as Vega-Lite [1]. Different users appreciate different such mappings. It is hence only natural to ask the question: What is the meaning of personalized data visualization when both data and mapping of data to visual elements can be personalized? Should they both be personalized? Usually, only data is personalized. A follow up question is: Can we learn those visualizations from observing people interacting with data and visual elements? How expressive should this interaction be? How granular should it be? Should we decouple data recommendations from visual recommendations and allow people to provide feedback on them separately and together? Can we do that in a privacy-preserving manner? Finally, since we are talking about users interacting with data and visualizations, how to ensure sub-second interaction time in personalized visual exploration?
These challenges lay the ground for new research that is truly at the crossroad of databases for privacy-preserving machine learning and declarative visualization. A promising avenue to address all three challenges is to combine the relational algebra with Vega-Lite to extend it to handle in-database Multi-Armed Bandits for online learning [2] and use in-database federated learning to ensure privacy [3]. The augmented algebra will provide the ability to blend data and visualization recommendations in the same framework. It will open new opportunities to co-optimize the training and inference of ML models while taking into consideration user feedback. This is possible by revisiting the reward according to the data and visual elements that are best preferred by the user as they explore (data, visual element) pairs in sequence. Speeding up ML inference can be done by materializing views representing the most common exploration pathways and personalizing them on-the-fly. Finally, federated learning can leverage cryptographic schemes such as homomorphic encryption to compute rewards without revealing data in clear. That is particularly important when multiple users explore data.
Emerging Applications
The framework outlined above is applicable in a key emerging application, enabling visual exploratory data analysis for users with different data science and domain expertise. We developed Dora The Explorer, an RL-powered data exploration system for astronomers [4], and DashBot [5], an adaptive MAB-powered system for medical professionals. The astronomers we worked with were well-versed in data science and preferred partially guided approaches where they could intervene and change the automatic decisions made by the machine learned models. The doctors we worked demanded to validate the patient analytics that were recommended to them. This was fed back into the system to best learn which next analytics to provide. Enabling this conversation between automated data exploration and people with different expertise levels is the next frontier in personalized visual data exploration.
Leilani Battle, University of Washington, USA
How do we provide the “right” guidance?
With the rise of AI assistants in data science [9], we are seeing more and more auto-generated recommendations for what analysts should explore next in their data [13]. However, how do we know that these assistants are providing the right guidance [10]? For example, current techniques often produce generic recommendations with limited benefits over standard baselines [13][6], which may be due in part to the underlying algorithms and models lacking domain context when generating recommendations [6]. Further, certain analysts may be too trusting of auto-generated recommendations regardless of rigor/quality [12]. Given their potential influence over human analysts, how can we ensure that our recommendations avoid introducing bias or causing harm? We need new methods for incorporating domain context into the underlying algorithms and AI models as well as benchmarks for validating their outputs. Furthermore, we need a deeper understanding of how auto-generated recommendations influence the reasoning of human analysts [9]. As part of this effort, we encourage the community to expand and strengthen collaborations across data science (e.g., among HCI, visualization, and data management researchers) as well as with other fields such as psychology and cognitive science.
How do we optimize interoperable components within larger data science ecosystems?
Computational notebooks are integral to collaborative data science work given their flexibility and ease of sharing with others [11][14]. However, notebook environments like JupyterLab can be challenging to design for [8]. Further, the data management and data visualization communities are used to building systems they control [7]. Notebooks contradict that assumption. Beyond the known challenges of working within an existing development environment, we observe few projects that profile or optimize the performance of data management/visualization tools that run alongside them. For example, the user drives the workload that big data researchers intend to optimize, but the user’s interactions with a DBMS may be influenced by other tools and packages they have also imported into the underlying notebook. Thus, we must investigate how this blending of tools may alter the typical data processing workloads we see in other big data contexts. Otherwise, our optimizations may miss the full context of how a tool is being used and thus fail to address the user’s performance concerns.
Yifan Hu Northeastern University, USA
Automatically learning to generate aesthetically pleasing networking visualization through human examples
Information visualization enables the graphical representation of data and information that helps the users to find patterns, trends, and relationships within the data, and to communicate complex data sets clearly and effectively. In the field of network visualization, traditionally, visualization of undirected graphs relies on heuristics, especially those based on modeling graphs as physical systems, with the belief that minimizing the energy of such systems lead to aesthetic looking layout of the graph that help to illustrate the underlining unstructured data. There are two problems with this approach. First, traditional force directed algorithms [15][16][17] can have high complexity and may not find the optimal solution. Second, and more importantly, it is not proven that human aesthetic preference can be modeled well by physical systems in all cases. With the advent of deep learning, there is a growing interest in leveraging the power of neural networks to help network visualization (e.g., [18][19]). In particular, it is now possible to use deep learning to expand the horizon of traditional graph visualization in a number of directions. Firstly, it was demonstrated that a graph neural network (GNN) model is able to be employed to optimize arbitrary differentiable objective functions. Once trained, such a model can be applied to arbitrary graphs never seen in the training data and create visualizations which optimize the objective function even better than traditional force directed algorithms [20][21][23][23]. Secondly, it was shown that by using Generative Adversarial Network (GAN), we can even train neural networks to optimize non-smooth objective functions [24]. In fact, this approach does not require the access of the objective function at all, and only requires a comparative function that can choose between two visualizations, the “better” one (based on a hidden objective function unknown to GAN). This opens the avenues for future investigation: instead of optimizing the energy of the physical system, or other arbitrary objectives such as edge crossing, we should be able to model the human sense of aesthetics directly. For example, if reducing edge cross is what’s most important for human, Tiezzi et al. [21] demonstrated that a fully connected network can be used to classify whether two edges cross; on the other hand, Wang et al [24] has shown that the discriminator in a GAN setup can be trained to choose drawings with lower number of edge crossing, by “teaching” it only with pairs of bad and good drawings examples. The holy grail is to train a model on a collection of example visualizations, to learn human visual preference. We believe such a system is possible, but the challenges are to be able to scale GNN, GAN or other deep learning-based models to very large unstructured data, as well as to collect large amounts of human preference examples. We have witnessed some progress in the scalability area already, e.g., by the use of multilevel paradigm [25][26][27]. Further work is needed to curate large amounts of training data, and to make deep learning-based network visualization adapt to human preferences, incorporating human specified constraints, and to run even faster than traditional network layout algorithms.
Generate not just the visualization but also the story behind it
Another future direction for research involves not just producing the visualization itself but also crafting a narrative, or even combining it with animations/videos that elucidate the patterns depicted within the visualization, thus directing the user’s attention to the most crucial aspects of generated visualization. Accomplishing this task will necessitate leveraging a multimodal large language model (LLM) trained extensively on a diverse corpus consisting of source data (e.g., CSV files, unstructured data), their visualizations, and their corresponding captions extracted from past visualization and other scientific literature. If we can solve the above challenges, we can then apply such a model to any data to be analyzed, and automatically generate sample visualizations, animations and narratives that are not only aesthetically pleasing to look at, but also informative, and with a well told story about the visualization.
Dominik Moritz, Carnegie Mellon University & Apple, USA
Machine learning development has shifted from being primarily model-centric to primarily data-centric [28]. In the early days of ML, engineers often trained models on hundreds or thousands of data points and meticulously tweaked the modeling function so as not to overfit the data. With the popularization of deep learning, however, data became increasingly important, and many models were derived from pre-trained models. Today, foundation models have similar transformer architectures, and big quality differences stem from the data they are trained or fine-tuned on. The data management community is well-equipped to tackle many of the quality and quantity challenges [29]. Yet, the full AIML lifecycle of requirement elicitation [30], data preparation, monitoring, tuning, and evaluation requires oversight by people who have to slice and dice millions or billions of data records. Since models synthesize patterns, a single record rarely defines the behavior of a model, and therefore ML engineers need to understand patterns and trends in relevant subsets [31]. Relevant subsets are hard to predict and can rarely be precomputed. ML engineers need to grasp often subtle patterns in large datasets full of information. As Herbert A. Simon puts it “What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently”. This need to efficiently allocate attention is where data visualization becomes crucial.
Data visualization leverages our powerful perceptual systems to let us see patterns and trends in data. Data visualization in the AIML lifecycle gives us a way to overview and design datasets for training, fine-tuning, and evaluation. It enables serendipitous discovery of data patterns and issues. At the core of the necessary interactive analysis are interfaces that are fast and fast to use. On the one hand, well-designed mixed initiative systems [32] for analysis make people faster [33][34]. We should continue to invest in good tooling that eliminates barriers to effective analysis. On the other hand, delays in interactive systems lead to fewer observations made [35] and could steer analysts towards convenient and fast data along with all the resulting biases. Fast databases, approximation [36][37], prefetching [38], and indexing techniques [39][40]can help developers build fast interactive systems. However, they add complexity for tool builders that may prevent them from quickly adapting interfaces to new needs. Modern data architectures abstract from the low-level optimizations like Mosaic [41] and are being adopted rapidly. Yet, many opportunities remain for deeper integration with databases, which are often optimized for order-of-second responses rather than real-time order-of-millisecond responses or may not support many of the encoding transformations used in effective data visualizations like cartographic projections. A lot of exciting work remains to explore new ways to effectively present billion-record datasets at the speed of human thought to facilitate effective analysis and communication.
Aditya Parameswaran, Berkley University, USA
Recent advances in Generative AI, specifically Large Language Models (LLMs), have the promise to fully democratize data work. We have within our grasp the ability to allow anyone, even without a coding background, to do data work at scale—spanning data exploration, transformation, and insight discovery. LLMs can help bridge the gap between users and their data: they can help communicate with users better in natural language, and also understand the data itself better, spanning both structured and unstructured data formats.
Despite their promise, however, LLMs are unfortunately too brittle for data work. They often make mistakes, disregard instructions, and are prone to “hallucinating” incorrect facts. Users find it difficult to both understand the reasoning process of the LLM and recover from LLM failures. Users struggle to even understand how to ask the LLM to perform a specific task: small changes in wording can lead to drastically different outcomes. So, to democratize data work with LLMs, we will need to leverage ideas from multiple disciplines–including databases and human-computer interaction–by tackling the following research questions:
How can we make LLM-powered data tools more robust? The main difficulty with LLMs is that they make mistakes in an unpredictable fashion. Thus, for any data task at scale, it is virtually guaranteed that there will be LLM mistakes. So, to enhance the robustness of LLMs for data tasks, there are several unanswered questions. How do we best (and automatically) decompose tasks into smaller “tightly scoped” ones that are less difficult for LLMs? How do we ensure that these smaller tasks have a certain degree of reliability, perhaps by combining the results of one LLM with those of others with different capabilities? How do we deal with LLM inconsistencies (e.g., LLMs say A>B, B>C and C>A)? Can we catch LLM mistakes before they occur, perhaps by having LLMs themselves synthesize assertions and verify each other’s work [44][45]? This work is closely related to the data management community’s work on crowdsourcing, which dealt with error-prone humans, much like error-prone LLMs [43].
How can LLM-powered data tools help analyze unstructured data? The vast majority of data exists in unstructured, difficult to analyze, formats including PDFs, word files, images, and videos. Now, with the power of LLMs, we may now be able to empower users to make sense of such datasets via intuitive interfaces [42]. For example, journalists investigating police misconduct may want to extract officer names from internal affairs reports, or search for specific activities in camera feeds, all without code. This leads to several questions, such as: How do we design lightweight interaction modalities that allow users to “guide” the system by navigating, searching, and highlighting points of interest in these media? How can we best leverage LLMs to automatically craft and suggest transformations based on the content, user guidance, and historical data? How do we design presentation techniques to help users “decide” which among a list of ranked suggestions is the one they would like to pursue?
How should LLM-powered data tools interact with users? Present day LLMs employ a chat-based natural language interface. While this interface is flexible, it is a poor fit for data work as it violates many fundamental human-centered design principles, including giving users a place to start or recover from errors, as well as showcasing the capabilities of the underlying system. Moreover, it sits separate from existing popular data tools such as spreadsheets or computational notebooks, where data exploration and visualization actually happen [47]. So, we ask the question: how should next-generation data work interfaces look like? What are the strengths and weaknesses of various forms of user input, including traditional code, natural language, form-based interfaces, demonstration, examples, and direct manipulation [46], for the purpose of data work? If we do support other forms of input beyond natural language, how do we translate such input to and from natural language so that it can be best interpreted by an LLM? Finally, data work is rarely a solo activity—how do we best support collaboration and handing-off work across individuals?
Overall, I believe this ambitious, but risky vision of democratizing big data work by leveraging the power of LLMs to both understand users and our data better, has the potential to have tremendous impact across a wide variety of domains.
Copyright @ 2024, Sihem Amer-Yahia, Leilani Battle, Yifan Hu, Dominik Moritz, Aditya Parameswaran, Nikos Bikakis, Panos K. Chrysanthis, Guoliang Li, George Papastefanatos, Lingyun Yu, All rights reserved.