DOI: 10.1109/tkde.2019.2962680
Terbit pada 1 Agustus 2021 Pada IEEE Transactions on Knowledge and Data Engineering

CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories

Lidia Contreras-Ochando M. J. R. Quintana C. Ferri + 5 penulis

Abstrak

CRISP-DM(CRoss-Industry Standard Process for Data Mining) has its origins in the second half of the nineties and is thus about two decades old. According to many surveys and user polls it is still the de facto standard for developing data mining and knowledge discovery projects. However, undoubtedly the field has moved on considerably in twenty years, with data science now the leading term being favoured over data mining. In this paper we investigate whether, and in what contexts, CRISP-DM is still fit for purpose for data science projects. We argue that if the project is goal-directed and process-driven the process model view still largely holds. On the other hand, when data science projects become more exploratory the paths that the project can take become more varied, and a more flexible model is called for. We suggest what the outlines of such a trajectory-based model might look like and how it can be used to categorise data science projects (goal-directed, exploratory or data management). We examine seven real-life exemplars where exploratory activities play an important role and compare them against 51 use cases extracted from the NIST Big Data Public Working Group. We anticipate this categorisation can help project planning in terms of time and cost characteristics.

Artikel Ilmiah Terkait

CRISP-DM for Data Science: Strengths, Weaknesses and Potential Next Steps

J. Saltz

15 Desember 2021

This paper explores the strengths and weaknesses of CRISP-DM when used for data science projects. The paper then explores what key actions data science teams using CRISP-DM should consider that addresses CRISP-DM’s weaknesses. In brief, CRISP-DM, which is the most popular framework teams use to execute data science projects, provides an easy to understand description of the data science project workflow (i.e., the data science life cycle). However, CRISP-DM’s project phases miss some key aspects of the data science project life cycle. In addition, CRISP-DM’s task-focused approach fails to address how a team should prioritize tasks, and in general, collaborate and communicate. Hence, this paper also describes how CRISP-DM could be combined with a team coordination framework, such as Scrum or Data Driven Scrum, which is a newer collaboration framework developed to address the unique data science coordination challenges.

Current approaches for executing big data science projects—a systematic literature review

I. Krasteva J. Saltz

21 Februari 2022

There is an increasing number of big data science projects aiming to create value for organizations by improving decision making, streamlining costs or enhancing business processes. However, many of these projects fail to deliver the expected value. It has been observed that a key reason many data science projects don’t succeed is not technical in nature, but rather, the process aspect of the project. The lack of established and mature methodologies for executing data science projects has been frequently noted as a reason for these project failures. To help move the field forward, this study presents a systematic review of research focused on the adoption of big data science process frameworks. The goal of the review was to identify (1) the key themes, with respect to current research on how teams execute data science projects, (2) the most common approaches regarding how data science projects are organized, managed and coordinated, (3) the activities involved in a data science projects life cycle, and (4) the implications for future research in this field. In short, the review identified 68 primary studies thematically classified in six categories. Two of the themes (workflow and agility) accounted for approximately 80% of the identified studies. The findings regarding workflow approaches consist mainly of adaptations to CRISP-DM (vs entirely new proposed methodologies). With respect to agile approaches, most of the studies only explored the conceptual benefits of using an agile approach in a data science project (vs actually evaluating an agile framework being used in a data science context). Hence, one finding from this research is that future research should explore how to best achieve the theorized benefits of agility. Another finding is the need to explore how to efficiently combine workflow and agile frameworks within a data science context to achieve a more comprehensive approach for project execution.

The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large

Mohammad Wardat Sumon Biswas Hridesh Rajan

2 Desember 2021

Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large, Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.

Data Mining preparation: Process, Techniques and Major Issues in Data Analysis

Mustafa Abdalrassual Jassim S. Abdulwahid

1 Maret 2021

Data preparation is an essential stage in data analysis. Many institutions or companies are interested in converting data into pure forms that can be used for scientific and profit purposes. It helps you set goals regarding system capabilities and features or the benefits your company expects from its investment. This purpose creates an immediate need to review and prepare the data to clean the raw data. In this paper, we highlight the importance of data preparation in data analysis and data extraction techniques, in addition to an integrated overview of relevant recent studies dealing with mining methodology, data types diversity, user interaction, and data mining. Finally, we suggest some potential suggestions for future research and development.

In the Backrooms of Data Science

P. Almklov Thomas Østerlie Elena Parmiggiani

2022

Much information systems research on data science treats data as preexisting objects and focuses on how these objects are analyzed. Such a view, however, overlooks the work involved in finding and preparing the data in the first place, such that they are available to be analyzed. In this paper, we draw on a longitudinal study of data management in the oil and gas industry to shed light on this backroom data work. We find that this type of work is qualitatively different from the front-stage data analytics in the realm of data science but is also deeply interwoven with it. We show that this work is unstable and bidirectional. That is, the work practices are constantly changing and must simultaneously take into account what data might be possible to access as well as the potential future uses of the data. It is also a collaborative endeavor involving cross-disciplinary expertise that seeks to establish control over data and is shaped by the epistemological orientation of the oil and gas domain.

Daftar Referensi

0 referensi

Tidak ada referensi ditemukan.

Artikel yang Mensitasi

0 sitasi

Tidak ada artikel yang mensitasi.