DOI: 10.7717/peerj-cs.1361
Terbit pada 18 Mei 2023 Pada PeerJ Computer Science

What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities

Fatih Gurcan

Abstrak

Background Because of the growing involvement of communities from various disciplines, data science is constantly evolving and gaining popularity. The growing interest in data science-based services and applications presents numerous challenges for their development. Therefore, data scientists frequently turn to various forums, particularly domain-specific Q&A websites, to solve difficulties. These websites evolve into data science knowledge repositories over time. Analysis of such repositories can provide valuable insights into the applications, topics, trends, and challenges of data science. Methods In this article, we investigated what data scientists are asking by analyzing all posts to date on DSSE, a data science-focused Q&A website. To discover main topics embedded in data science discussions, we used latent Dirichlet allocation (LDA), a probabilistic approach for topic modeling. Results As a result of this analysis, 18 main topics were identified that demonstrate the current interests and issues in data science. We then examined the topics’ popularity and difficulty. In addition, we identified the most commonly used tasks, techniques, and tools in data science. As a result, “Model Training”, “Machine Learning”, and “Neural Networks” emerged as the most prominent topics. Also, “Data Manipulation”, “Coding Errors”, and “Tools” were identified as the most viewed (most popular) topics. On the other hand, the most difficult topics were identified as “Time Series”, “Computer Vision”, and “Recommendation Systems”. Our findings have significant implications for many data science stakeholders who are striving to advance data-driven architectures, concepts, tools, and techniques.

Artikel Ilmiah Terkait

AutoDS: Towards Human-Centered Automation of Data Science

Dakuo Wang Erick Oduor Justin D. Weisz + 2 lainnya

13 Januari 2021

Data science (DS) projects often follow a lifecycle that consists of laborious tasks for data scientists and domain experts (e.g., data exploration, model training, etc.). Only till recently, machine learning(ML) researchers have developed promising automation techniques to aid data workers in these tasks. This paper introduces AutoDS, an automated machine learning (AutoML) system that aims to leverage the latest ML automation techniques to support data science projects. Data workers only need to upload their dataset, then the system can automatically suggest ML configurations, preprocess data, select algorithm, and train the model. These suggestions are presented to the user via a web-based graphical user interface and a notebook-based programming user interface. Our goal is to offer a systematic investigation of user interaction and perceptions of using an AutoDS system in solving a data science task. We studied AutoDS with 30 professional data scientists, where one group used AutoDS, and the other did not, to complete a data science project. As expected, AutoDS improves productivity; Yet surprisingly, we find that the models produced by the AutoDS group have higher quality and less errors, but lower human confidence scores. We reflect on the findings by presenting design implications for incorporating automation techniques into human work in the data science lifecycle.

On the Variability of Software Engineering Needs for Deep Learning: Stages, Trends, and Application Types

A. Mockus Minghui Zhou Zhixing Wang + 1 lainnya

1 Februari 2023

The wide use of Deep learning (DL) has not been followed by the corresponding advances in software engineering (SE) for DL. Research shows that developers writing DL software have specific development stages (i.e., SE4DL stages) and face new DL-specific problems. Despite substantial research, it is unclear how DL developers’ SE needs for DL vary over stages, application types, or if they change over time. To help focus research and development efforts on DL-development challenges, we analyze 92,830 Stack Overflow (SO) questions and 227,756 READMEs of public repositories related to DL. Latent Dirichlet Allocation (LDA) reveals 27 topics for the SO questions where 19 (70.4%) topics mainly relate to a single SE4DL stage, and eight topics span multiple stages. Most questions concern <italic>Data Preparation</italic> and <italic>Model Setup</italic> stages. The relative rates of questions for 11 topics have increased, for eight topics decreased over time. Questions for the former 11 topics had a lower percentage of accepting an answer than the remaining questions. LDA on README files reveals 16 distinct application types for the 227k repositories. We apply the LDA model fitted on READMEs to the 92,830 SO questions and find that 27% of the questions are related to the 16 DL application types. The most asked question topic varies across application types, with half primarily relating to the second and third stages. Specifically, developers ask the most questions about topics primarily relating to <italic>Data Preparation</italic> (2nd) stage for four mature application types such as <inline-formula><tex-math notation="LaTeX">${{\sf Image\ Segmentation}}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi mathvariant="sans-serif">Image</mml:mi><mml:mspace width="4pt"/><mml:mi mathvariant="sans-serif">Segmentation</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq1-3163576.gif"/></alternatives></inline-formula>, and topics primarily relating to <italic>Model Setup</italic> (3rd) stage for four application types concerning emerging methods such as <inline-formula><tex-math notation="LaTeX">${{\sf Transfer\ Learning}}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi mathvariant="sans-serif">Transfer</mml:mi><mml:mspace width="4pt"/><mml:mi mathvariant="sans-serif">Learning</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq2-3163576.gif"/></alternatives></inline-formula>. Based on our findings, we distill several actionable insights for SE4DL research, practice, and education, such as better support for using trained models, application-type specific tools, and teaching materials.

What is Data Science?

Michael L. Brodie

20 Januari 2023

The Communications website, https://cacm.acm.org, features more than a dozen bloggers in the BLOG@CACM community. In each issue of Communications, we'll publish selected posts or excerpts. twitter Follow us on Twitter at http://twitter.com/blogCACM https://cacm.acm.org/blogs/blog-cacm Koby Mike and Orit Hazzan consider why multiple definitions are needed to pin down data science.

Automating data science

C. Williams Padhraic Smyth Luc de Raedt + 3 lainnya

12 Mei 2021

Given the complexity of data science projects and related demand for human expertise, automation has the potential to transform the data science process.

Data science: a game changer for science and innovation

P. Pagano Valerio Grossi D. Pedreschi + 3 lainnya

19 April 2021

This paper shows data science’s potential for disruptive innovation in science, industry, policy, and people’s lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e-infrastructure as useful tools for supporting ethical data science and training new generations of data scientists. Finally, this work outlines SoBigData Research Infrastructure as an easy-to-access platform for executing complex data science processes. The services proposed by SoBigData are aimed at using data science to understand the complexity of our contemporary, globally interconnected society.

Daftar Referensi

0 referensi

Tidak ada referensi ditemukan.

Artikel yang Mensitasi

0 sitasi

Tidak ada artikel yang mensitasi.