The State of the ML-universe: 10 Years of Artificial Intelligence & Machine Learning Software Development on GitHub
Abstrak
In the last few years, artificial intelligence (AI) and machine learning (ML) have become ubiquitous terms. These powerful techniques have escaped obscurity in academic communities with the recent onslaught of AI & ML tools, frameworks, and libraries that make these techniques accessible to a wider audience of developers. As a result, applying AI & ML to solve existing and emergent problems is an increasingly popular practice. However, little is known about this domain from the software engineering perspective. Many AI & ML tools and applications are open source, hosted on platforms such as GitHub that provide rich tools for large-scale distributed software development. Despite widespread use and popularity, these repositories have never been examined as a community to identify unique properties, development patterns, and trends. In this paper, we conducted a large-scale empirical study of AI & ML Tool (700) and Application (4,524) repositories hosted on GitHub to develop such a characterization. While not the only platform hosting AI & ML development, GitHub facilitates collecting a rich data set for each repository with high traceability between issues, commits, pull requests and users. To compare the AI & ML community to the wider population of repositories, we also analyzed a set of 4,101 unrelated repositories. We enhance this characterization with an elaborate study of developer workflow that measures collaboration and autonomy within a repository. We've captured key insights of this community's 10 year history such as it's primary language (Python) and most popular repositories (Tensorflow, Tesseract). Our findings show the AI & ML community has unique characteristics that should be accounted for in future research.
Artikel Ilmiah Terkait
Alexandre Decan T. Mens Pooya Rostami Mazrae + 1 lainnya
1 Oktober 2022
GitHub Actions was introduced in 2019 and constitutes an integrated alternative to CI/CD services for GitHub repositories. The deep integration with GitHub allows repositories to easily automate software development workflows. This paper empirically studies the use of GitHub Actions on a dataset comprising 68K repositories on GitHub, of which 43.9% are using GitHub Actions workflows. We analyse which workflows are automated and identify the most frequent automation practices. We show that reuse of actions is a common practice, even if this reuse is concentrated in a limited number of actions. We study which actions are most frequently used and how workflows refer to them. Furthermore, we discuss the related security and versioning aspects. As such, we provide an overview of the use of GitHub Actions, constituting a necessary first step towards a better understanding of this emerging ecosystem and its implications on collaborative software development in the GitHub social coding platform.
A. Mockus Minghui Zhou Zhixing Wang + 1 lainnya
1 Februari 2023
The wide use of Deep learning (DL) has not been followed by the corresponding advances in software engineering (SE) for DL. Research shows that developers writing DL software have specific development stages (i.e., SE4DL stages) and face new DL-specific problems. Despite substantial research, it is unclear how DL developers’ SE needs for DL vary over stages, application types, or if they change over time. To help focus research and development efforts on DL-development challenges, we analyze 92,830 Stack Overflow (SO) questions and 227,756 READMEs of public repositories related to DL. Latent Dirichlet Allocation (LDA) reveals 27 topics for the SO questions where 19 (70.4%) topics mainly relate to a single SE4DL stage, and eight topics span multiple stages. Most questions concern <italic>Data Preparation</italic> and <italic>Model Setup</italic> stages. The relative rates of questions for 11 topics have increased, for eight topics decreased over time. Questions for the former 11 topics had a lower percentage of accepting an answer than the remaining questions. LDA on README files reveals 16 distinct application types for the 227k repositories. We apply the LDA model fitted on READMEs to the 92,830 SO questions and find that 27% of the questions are related to the 16 DL application types. The most asked question topic varies across application types, with half primarily relating to the second and third stages. Specifically, developers ask the most questions about topics primarily relating to <italic>Data Preparation</italic> (2nd) stage for four mature application types such as <inline-formula><tex-math notation="LaTeX">${{\sf Image\ Segmentation}}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi mathvariant="sans-serif">Image</mml:mi><mml:mspace width="4pt"/><mml:mi mathvariant="sans-serif">Segmentation</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq1-3163576.gif"/></alternatives></inline-formula>, and topics primarily relating to <italic>Model Setup</italic> (3rd) stage for four application types concerning emerging methods such as <inline-formula><tex-math notation="LaTeX">${{\sf Transfer\ Learning}}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi mathvariant="sans-serif">Transfer</mml:mi><mml:mspace width="4pt"/><mml:mi mathvariant="sans-serif">Learning</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq2-3163576.gif"/></alternatives></inline-formula>. Based on our findings, we distill several actionable insights for SE4DL research, practice, and education, such as better support for using trained models, application-type specific tools, and teaching materials.
Dhivyabharathi Ramasamy Cristina Sarasua Abraham Bernstein + 1 lainnya
19 November 2022
Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.
Natarajan Chidambaram Eleni Constantinou Alexandre Decan + 2 lainnya
1 September 2022
Using popular open source projects on GitHub, we provide evidence that bots are regularly among the most active contributors, even though GitHub does not explicitly acknowledge their presence. This poses a problem for techniques that analyze human contributor activity.
Kartik Rawool Bowen Xu Deep Mehta + 1 lainnya
20 Desember 2023
Automating software development processes through the orchestration of GitHub Action workflows has revolutionized the efficiency and agility of software delivery pipelines. This paper presents a detailed investigation into the use of Large Language Models (LLMs) specifically, GPT 3.5 and GPT 4 to generate and evaluate GitHub Action workflows for DevOps tasks. Our methodology involves data collection from public GitHub repositories, prompt engineering for LLM utilization, and evaluation metrics encompassing exact match scores, BLEU scores, and a novel DevOps Aware score. The research scrutinizes the proficiency of GPT 3.5 and GPT 4 in generating GitHub workflows, while assessing the influence of various prompt elements in constructing the most efficient pipeline. Results indicate substantial advancements in GPT 4, particularly in DevOps awareness and syntax correctness. The research introduces a GitHub App built on Probot, empowering users to automate workflow generation within GitHub ecosystem. This study contributes insights into the evolving landscape of AI-driven automation in DevOps practices.
Daftar Referensi
0 referensiTidak ada referensi ditemukan.
Artikel yang Mensitasi
0 sitasiTidak ada artikel yang mensitasi.