DOI: 10.1109/TVCG.2023.3327367
Terbit pada 8 Agustus 2023 Pada IEEE Transactions on Visualization and Computer Graphics

Dead or Alive: Continuous Data Profiling for Interactive Data Science

Adam Perer Vaishnavi Gorantla Will Epperson + 1 penulis

Abstrak

Profiling data by plotting distributions and analyzing summary statistics is a critical step throughout data analysis. Currently, this process is manual and tedious since analysts must write extra code to examine their data after every transformation. This inefficiency may lead to data scientists profiling their data infrequently, rather than after each transformation, making it easy for them to miss important errors or insights. We propose continuous data profiling as a process that allows analysts to immediately see interactive visual summaries of their data throughout their data analysis to facilitate fast and thorough analysis. Our system, AutoProfiler, presents three ways to support continuous data profiling: (1) it automatically displays data distributions and summary statistics to facilitate data comprehension; (2) it is live, so visualizations are always accessible and update automatically as the data updates; (3) it supports follow up analysis and documentation by authoring code for the user in the notebook. In a user study with 16 participants, we evaluate two versions of our system that integrate different levels of automation: both automatically show data profiles and facilitate code authoring, however, one version updates reactively (“live”) and the other updates only on demand (“dead”). We find that both tools, dead or alive, facilitate insight discovery with 91% of user-generated insights originating from the tools rather than manual profiling code written by users. Participants found live updates intuitive and felt it helped them verify their transformations while those with on-demand profiles liked the ability to look at past visualizations. We also present a longitudinal case study on how AutoProfiler helped domain scientists find serendipitous insights about their data through automatic, live data profiles. Our results have implications for the design of future tools that offer automated data analysis support.

Artikel Ilmiah Terkait

AutoDS: Towards Human-Centered Automation of Data Science

Dakuo Wang Erick Oduor Justin D. Weisz + 2 lainnya

13 Januari 2021

Data science (DS) projects often follow a lifecycle that consists of laborious tasks for data scientists and domain experts (e.g., data exploration, model training, etc.). Only till recently, machine learning(ML) researchers have developed promising automation techniques to aid data workers in these tasks. This paper introduces AutoDS, an automated machine learning (AutoML) system that aims to leverage the latest ML automation techniques to support data science projects. Data workers only need to upload their dataset, then the system can automatically suggest ML configurations, preprocess data, select algorithm, and train the model. These suggestions are presented to the user via a web-based graphical user interface and a notebook-based programming user interface. Our goal is to offer a systematic investigation of user interaction and perceptions of using an AutoDS system in solving a data science task. We studied AutoDS with 30 professional data scientists, where one group used AutoDS, and the other did not, to complete a data science project. As expected, AutoDS improves productivity; Yet surprisingly, we find that the models produced by the AutoDS group have higher quality and less errors, but lower human confidence scores. We reflect on the findings by presenting design implications for incorporating automation techniques into human work in the data science lifecycle.

Cheat Sheets for Data Visualization Techniques

Zezhong Wang Lovisa Sundin Dave Murray-Rust + 1 lainnya

18 Januari 2020

This paper introduces the concept of 'cheat sheets' for data visualization techniques, a set of concise graphical explanations and textual annotations inspired by infographics, data comics, and cheat sheets in other domains. Cheat sheets aim to address the increasing need for accessible material that supports a wide audience in understanding data visualization techniques, their use, their fallacies and so forth. We have carried out an iterative design process with practitioners, teachers and students of data science and visualization, resulting six types of cheat sheet (anatomy, construction, visual patterns, pitfalls, false-friends and well-known relatives) for six types of visualization, and formats for presentation. We assess these with a qualitative user study using 11 participants that demonstrates the readability and usefulness of our cheat sheets.

Interactive Data Visualization in Jupyter Notebooks

J. Freire J. Comba K. Gaither + 2 lainnya

1 Maret 2021

Interactive visualizations are at the core of the exploratory data analysis process, enabling users to directly manipulate and gain insights from data. In this article, we present three different ways in which interactive visualizations can be included in Jupyter Notebooks: 1) matplotlib callbacks; 2) visualization toolkits; and 3) embedding HTML visualizations. We hope that this article will help developers to select the best tools to build their interactive charts in Jupyter Notebooks.

Variability in data visualization: a software product line approach

J. Horcas J. Galindo David Benavides

12 September 2022

Data visualization aims to effectively communicate quantitative information by understanding which techniques and displays work better for different circumstances and why. There are a variety of software solutions capable of generating a multitude of different visualizations of the same dataset. However, data visualization exposes a large space of visual configurations depending on the type of data to be visualized, the different displays (e.g., scatter plots, line graphs, pie charts), the visual components to encode the data (e.g., lines, dots, bars), or the specific visual attributes of those components (e.g., color, shape, size, length). Researchers and developers are not usually aware about best practices in data visualization, and they are required to learn about both the design practices that make communication effective and the low level details of the specific software tool used to generate the visualization. This paper proposes a software product line approach to model and materialize the variability of the visualization design process, guided by feature models. We encode the visualization knowledge regarding the best design practices, resolve the variability following a step-wise configuration approach, and then evaluate our proposal for a specific software visualization tool. Our solution helps researchers and developers communicate their quantitative results effectively by assisting them in the selection and generation of the visualizations that work best for each case. We open a new window of research where data visualization and variability meet each other.

How do Data Science Workers Collaborate? Roles, Workflows, and Tools

Michael J. Muller Dakuo Wang Amy X. Zhang

18 Januari 2020

Today, the prominence of data science within organizations has given rise to teams of data science workers collaborating on extracting insights from data, as opposed to individual data scientists working alone. However, we still lack a deep understanding of how data science workers collaborate in practice. In this work, we conducted an online survey with 183 participants who work in various aspects of data science. We focused on their reported interactions with each other (e.g., managers with engineers) and with different tools (e.g., Jupyter Notebook). We found that data science teams are extremely collaborative and work with a variety of stakeholders and tools during the six common steps of a data science workflow (e.g., clean data and train model). We also found that the collaborative practices workers employ, such as documentation, vary according to the kinds of tools they use. Based on these findings, we discuss design implications for supporting data science team collaborations and future research directions.

Daftar Referensi

0 referensi

Tidak ada referensi ditemukan.

Artikel yang Mensitasi

0 sitasi

Tidak ada artikel yang mensitasi.