Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development

A. Hanna M. Scheuerman Emily L. Denton

Abstrak

Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real-world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision's propensity to shape machine learning research and impact human life, we seek to understand disciplinary practices around dataset documentation - how data is collected, curated, annotated, and packaged into datasets for computer vision researchers and practitioners to use for model tuning and development. Specifically, we examine what dataset documentation communicates about the underlying values of vision data and the larger practices and goals of computer vision as a field. To conduct this study, we collected a corpus of about 500 computer vision datasets, from which we sampled 114 dataset publications across different vision tasks. Through both a structured and thematic content analysis, we document a number of values around accepted data practices, what makes desirable data, and the treatment of humans in the dataset construction process. We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work. Many of the silenced values we identify sit in opposition with social computing practices. We conclude with suggestions on how to better incorporate silenced values into the dataset creation and curation process.

Artikel Ilmiah Terkait

Documenting Computer Vision Datasets: An Invitation to Reflexive Data Practices

Tianling Yang A. Hanna Diana Serbanescu + 3 lainnya

3 Maret 2021

In industrial computer vision, discretionary decisions surrounding the production of image training data remain widely undocumented. Recent research taking issue with such opacity has proposed standardized processes for dataset documentation. In this paper, we expand this space of inquiry through fieldwork at two data processing companies and thirty interviews with data workers and computer vision practitioners. We identify four key issues that hinder the documentation of image datasets and the effective retrieval of production contexts. Finally, we propose reflexivity, understood as a collective consideration of social and intellectual factors that lead to praxis, as a necessary precondition for documentation. Reflexive documentation can help to expose the contexts, relations, routines, and power structures that shape data.

Large image datasets: A pyrrhic win for computer vision?

Abeba Birhane Vinay Uday Prabhu

24 Juni 2020

In this paper we investigate problematic practices and consequences of large scale vision datasets (LSVDs). We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets. Taking the ImageNet-ILSVRC-2012 dataset as an example, we perform a cross-sectional model-based quantitative census covering factors such as age, gender, NSFW content scoring, class- wise accuracy, human-cardinality-analysis, and the semanticity of the image class information in order to statistically investigate the extent and subtleties of ethical transgressions. We then use the census to help hand-curate a look-up-table of images in the ImageNet-ILSVRC-2012 dataset that fall into the categories of verifiably pornographic: shot in a non-consensual setting (up-skirt), beach voyeuristic, and exposed private parts. We survey the landscape of harm and threats both the society at large and individuals face due to uncritical and ill-considered dataset curation practices. We then propose possible courses of correction and critique their pros and cons. We have duly open-sourced all of the code and the census meta-datasets generated in this endeavor for the computer vision community to build on. By unveiling the severity of the threats, our hope is to motivate the constitution of mandatory Institutional Review Boards (IRB) for large scale dataset curation.

How We've Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis

M. Scheuerman Kandrea Wade Caitlin Lustig + 1 lainnya

28 Mei 2020

Race and gender have long sociopolitical histories of classification in technical infrastructures-from the passport to social media. Facial analysis technologies are particularly pertinent to understanding how identity is operationalized in new technical systems. What facial analysis technologies can do is determined by the data available to train and evaluate them with. In this study, we specifically focus on this data by examining how race and gender are defined and annotated in image databases used for facial analysis. We found that the majority of image databases rarely contain underlying source material for how those identities are defined. Further, when they are annotated with race and gender information, database authors rarely describe the process of annotation. Instead, classifications of race and gender are portrayed as insignificant, indisputable, and apolitical. We discuss the limitations of these approaches given the sociohistorical nature of race and gender. We posit that the lack of critical engagement with this nature renders databases opaque and less trustworthy. We conclude by encouraging database authors to address both the histories of classification inherently embedded into race and gender, as well as their positionality in embedding such classifications.

Model Positionality and Computational Reflexivity: Promoting Reflexivity in Data Science

Darren Gergle S. Cambo

8 Maret 2022

Data science and machine learning provide indispensable techniques for understanding phenomena at scale, but the discretionary choices made when doing this work are often not recognized. Drawing from qualitative research practices, we describe how the concepts of positionality and reflexivity can be adapted to provide a framework for understanding, discussing, and disclosing the discretionary choices and subjectivity inherent to data science work. We first introduce the concepts of model positionality and computational reflexivity that can help data scientists to reflect on and communicate the social and cultural context of a model’s development and use, the data annotators and their annotations, and the data scientists themselves. We then describe the unique challenges of adapting these concepts for data science work and offer annotator fingerprinting and position mining as promising solutions. Finally, we demonstrate these techniques in a case study of the development of classifiers for toxic commenting in online communities.

Data Science at the Singularity

David Donoho

2 Oktober 2023

A purported `AI Singularity' has been in the public eye recently. Mass media and US national political attention focused on `AI Doom' narratives hawked by social media influencers. The European Commission is announcing initiatives to forestall `AI Extinction'. In my opinion, `AI Singularity' is the wrong narrative for what's happening now; recent happenings signal something else entirely. Something fundamental to computation-based research really changed in the last ten years. In certain fields, progress is dramatically more rapid than previously, as the fields undergo a transition to frictionless reproducibility (FR). This transition markedly changes the rate of spread of ideas and practices, affects mindsets, and erases memories of much that came before. The emergence of frictionless reproducibility follows from the maturation of 3 data science principles in the last decade. Those principles involve data sharing, code sharing, and competitive challenges, however implemented in the particularly strong form of frictionless open services. Empirical Machine Learning (EML) is todays leading adherent field, and its consequent rapid changes are responsible for the AI progress we see. Still, other fields can and do benefit when they adhere to the same principles. Many rapid changes from this maturation are misidentified. The advent of FR in EML generates a steady flow of innovations; this flow stimulates outsider intuitions that there's an emergent superpower somewhere in AI. This opens the way for PR to push worrying narratives: not only `AI Extinction', but also the supposed monopoly of big tech on AI research. The helpful narrative observes that the superpower of EML is adherence to frictionless reproducibility practices; these practices are responsible for the striking progress in AI that we see everywhere.

Daftar Referensi

0 referensi

Tidak ada referensi ditemukan.

Artikel yang Mensitasi

0 sitasi

Tidak ada artikel yang mensitasi.