DOI: 10.1145/3491102.3501998
Terbit pada 8 Maret 2022 Pada International Conference on Human Factors in Computing Systems

Model Positionality and Computational Reflexivity: Promoting Reflexivity in Data Science

Darren Gergle S. Cambo

Abstrak

Data science and machine learning provide indispensable techniques for understanding phenomena at scale, but the discretionary choices made when doing this work are often not recognized. Drawing from qualitative research practices, we describe how the concepts of positionality and reflexivity can be adapted to provide a framework for understanding, discussing, and disclosing the discretionary choices and subjectivity inherent to data science work. We first introduce the concepts of model positionality and computational reflexivity that can help data scientists to reflect on and communicate the social and cultural context of a model’s development and use, the data annotators and their annotations, and the data scientists themselves. We then describe the unique challenges of adapting these concepts for data science work and offer annotator fingerprinting and position mining as promising solutions. Finally, we demonstrate these techniques in a case study of the development of classifiers for toxic commenting in online communities.

Artikel Ilmiah Terkait

Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development

A. Hanna M. Scheuerman Emily L. Denton

9 Agustus 2021

Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real-world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision's propensity to shape machine learning research and impact human life, we seek to understand disciplinary practices around dataset documentation - how data is collected, curated, annotated, and packaged into datasets for computer vision researchers and practitioners to use for model tuning and development. Specifically, we examine what dataset documentation communicates about the underlying values of vision data and the larger practices and goals of computer vision as a field. To conduct this study, we collected a corpus of about 500 computer vision datasets, from which we sampled 114 dataset publications across different vision tasks. Through both a structured and thematic content analysis, we document a number of values around accepted data practices, what makes desirable data, and the treatment of humans in the dataset construction process. We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work. Many of the silenced values we identify sit in opposition with social computing practices. We conclude with suggestions on how to better incorporate silenced values into the dataset creation and curation process.

Jury Learning: Integrating Dissenting Voices into Machine Learning Models

Mitchell L. Gordon Michael S. Bernstein J. Park + 4 lainnya

7 Februari 2022

Whose labels should a machine learning (ML) algorithm learn to emulate? For ML tasks ranging from online comment toxicity to misinformation detection to medical diagnosis, different groups in society may have irreconcilable disagreements about ground truth labels. Supervised ML today resolves these label disagreements implicitly using majority vote, which overrides minority groups’ labels. We introduce jury learning, a supervised ML approach that resolves these disagreements explicitly through the metaphor of a jury: defining which people or groups, in what proportion, determine the classifier’s prediction. For example, a jury learning model for online toxicity might centrally feature women and Black jurors, who are commonly targets of online harassment. To enable jury learning, we contribute a deep learning architecture that models every annotator in a dataset, samples from annotators’ models to populate the jury, then runs inference to classify. Our architecture enables juries that dynamically adapt their composition, explore counterfactuals, and visualize dissent. A field evaluation finds that practitioners construct diverse juries that alter 14% of classification outcomes.

An Exploration of Intersectionality in Software Development and Use

Alicia E. Boyd Hana Winchester Brittany Johnson

1 Mei 2022

The growing ubiquity of machine learning technologies has led to concern and concentrated efforts at improving data-centric research and practice. While much work has been done on addressing equity concerns with respect to unary identities (e.g, race or gender), little to no work in Software Engineering has studied intersectionality to determine how we can provide equitable outcomes for complex, overlapping social identities in data-driven tech. To this end, we designed a survey to learn the landscape of intersectional identities in tech, where these populations contribute data, and how marginalized populations feel about the impact technology has on their day to day lives. Our data thus far, collected from 12 respondents and composed mostly of white and male identities, further highlights the lack of representation in modern data sets and need for contributions that explicitly explore how to support data-driven research and development. ACM Reference Format: Hana Winchester, Alicia E. Boyd, and Brittany Johnson. 2022. An Exploration of Intersectionality in Software Development and Use. In Third Workshop on Gender Equality, Diversity, and Inclusion in Software Engineering (GE@ICSE’22), May 20, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3524501.3527605

What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities

Fatih Gurcan

18 Mei 2023

Background Because of the growing involvement of communities from various disciplines, data science is constantly evolving and gaining popularity. The growing interest in data science-based services and applications presents numerous challenges for their development. Therefore, data scientists frequently turn to various forums, particularly domain-specific Q&A websites, to solve difficulties. These websites evolve into data science knowledge repositories over time. Analysis of such repositories can provide valuable insights into the applications, topics, trends, and challenges of data science. Methods In this article, we investigated what data scientists are asking by analyzing all posts to date on DSSE, a data science-focused Q&A website. To discover main topics embedded in data science discussions, we used latent Dirichlet allocation (LDA), a probabilistic approach for topic modeling. Results As a result of this analysis, 18 main topics were identified that demonstrate the current interests and issues in data science. We then examined the topics’ popularity and difficulty. In addition, we identified the most commonly used tasks, techniques, and tools in data science. As a result, “Model Training”, “Machine Learning”, and “Neural Networks” emerged as the most prominent topics. Also, “Data Manipulation”, “Coding Errors”, and “Tools” were identified as the most viewed (most popular) topics. On the other hand, the most difficult topics were identified as “Time Series”, “Computer Vision”, and “Recommendation Systems”. Our findings have significant implications for many data science stakeholders who are striving to advance data-driven architectures, concepts, tools, and techniques.

Taking a critical look at the critical turn in data science: From “data feminism” to transnational feminist data science

Zhasmina Tacheva

1 Juli 2022

Through a critical analysis of recent developments in the theory and practice of data science, including nascent feminist approaches to data collection and analysis, this commentary aims to signal the need for a transnational feminist orientation towards data science. I argue that while much needed in the context of persistent algorithmic oppression, a Western feminist lens limits the scope of problems, and thus—solutions, critical data scholars, and scientists can consider. A resolutely transnational feminist approach on the other hand, can provide data theorists and practitioners with the hermeneutic tools necessary to identify and disrupt instances of injustice in a more inclusive and comprehensive manner. A transnational feminist orientation to data science can pay particular attention to the communities rendered most vulnerable by algorithmic oppression, such as women of color and populations in non-Western countries. I present five ways in which transnational feminism can be leveraged as an intervention into the current data science canon.

Daftar Referensi

0 referensi

Tidak ada referensi ditemukan.

Artikel yang Mensitasi

0 sitasi

Tidak ada artikel yang mensitasi.