Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models

Paul Denny Juho Leinonen Arto Hellas + 1 penulis

Abstrak

This article explores the natural language generation capabilities of large language models with application to the production of two types of learning resources common in programming courses. Using OpenAI Codex as the large language model, we create programming exercises (including sample solutions and test cases) and code explanations, assessing these qualitatively and quantitatively. Our results suggest that the majority of the automatically generated content is both novel and sensible, and in some cases ready to use as is. When creating exercises we find that it is remarkably easy to influence both the programming concepts and the contextual themes they contain, simply by supplying keywords as input to the model. Our analysis suggests that there is significant value in massive generative machine learning models as a tool for instructors, although there remains a need for some oversight to ensure the quality of the generated content before it is delivered to students. We further discuss the implications of OpenAI Codex and similar tools for introductory programming education and highlight future research streams that have the potential to improve the quality of the educational experience for both teachers and students alike.

Artikel Ilmiah Terkait

Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book

Arto Hellas Paul Denny S. Macneil + 5 lainnya

4 November 2022

Advances in natural language processing have resulted in large language models (LLMs) that can generate code and code explanations. In this paper, we report on our experiences generating multiple code explanation types using LLMs and integrating them into an interactive e-book on web software development. Three different types of explanations -- a line-by-line explanation, a list of important concepts, and a high-level summary of the code -- were created. Students could view explanations by clicking a button next to code snippets, which showed the explanation and asked about its utility. Our results show that all explanation types were viewed by students and that the majority of students perceived the code explanations as helpful to them. However, student engagement varied by code snippet complexity, explanation type, and code snippet length. Drawing on our experiences, we discuss future directions for integrating explanations generated by LLMs into CS classrooms.

Need a Programming Exercise Generated in Your Native Language? ChatGPT's Got Your Back: Automatic Generation of Non-English Programming Exercises Using OpenAI GPT-3.5

Kevin Ly Mollie Jordan Adalbert Gerald Soosai Raj

7 Maret 2024

Large language models (LLMs) like ChatGPT are changing computing education and may create additional barriers to those already faced by non-native English speakers (NNES) learning computing. We investigate an opportunity for a positive impact of LLMs on NNES through multilingual programming exercise generation. Following previous work with LLM exercise generation in English, we prompt OpenAI GPT-3.5 in 4 natural languages (English, Tamil, Spanish, and Vietnamese) to create introductory programming problems, sample solutions, and test cases. We evaluate these problems on their sensibility, readability, translation, sample solution accuracy, topicality, and cultural relevance. We find that problems generated in English, Spanish, and Vietnamese are largely sensible, easily understood, and accurate in their sample solutions. However, Tamil problems are mostly non-sensible and have a much lower passing test rate, indicating that the abilities of LLMs for problem generation are not generalizable across languages. Our analysis suggests that these problems could not be given verbatim to students, but with minimal effort, most errors can be fixed. We further discuss the benefits of these problems despite their flaws, and their opportunities to provide personalized and culturally relevant resources for students in their native languages.

ChatGPT and Software Testing Education: Promises & Perils

Sajed Jalil Suzzana Rafi Kevin Moran + 2 lainnya

7 Februari 2023

Over the past decade, predictive language modeling for code has proven to be a valuable tool for enabling new forms of automation for developers. More recently, we have seen the ad-vent of general purpose "large language models", based on neural transformer architectures, that have been trained on massive datasets of human written text, which includes code and natural language. However, despite the demonstrated representational power of such models, interacting with them has historically been constrained to specific task settings, limiting their general applicability. Many of these limitations were recently overcome with the introduction of ChatGPT, a language model created by OpenAI and trained to operate as a conversational agent, enabling it to answer questions and respond to a wide variety of commands from end users.The introduction of models, such as ChatGPT, has already spurred fervent discussion from educators, ranging from fear that students could use these AI tools to circumvent learning, to excitement about the new types of learning opportunities that they might unlock. However, given the nascent nature of these tools, we currently lack fundamental knowledge related to how well they perform in different educational settings, and the potential promise (or danger) that they might pose to traditional forms of instruction. As such, in this paper, we examine how well ChatGPT performs when tasked with answering common questions in a popular software testing curriculum. We found that given its current capabilities, ChatGPT is able to respond to 77.5% of the questions we examined and that, of these questions, it is able to provide correct or partially correct answers in 55.6% of cases, provide correct or partially correct explanations of answers in 53.0% of cases, and that prompting the tool in a shared question context leads to a marginally higher rate of correct answers and explanations. Based on these findings, we discuss the potential promises and perils related to the use of ChatGPT by students and instructors.

Exploring the Potential of Large Language Models to Generate Formative Programming Feedback

Dominic Lohr H. Keuning Natalie Kiesler

31 Agustus 2023

Ever since the emergence of large language models (LLMs) and related applications, such as ChatGPT, its performance and error analysis for programming tasks have been subject to research. In this work-in-progress paper, we explore the potential of such LLMs for computing educators and learners, as we analyze the feedback it generates to a given input containing program code. In particular, we aim at (1) exploring how an LLM like ChatGPT responds to students seeking help with their introductory programming tasks, and (2) identifying feedback types in its responses. To achieve these goals, we used students' programming sequences from a dataset gathered within a CS1 course as input for ChatGPT along with questions required to elicit feedback and correct solutions. The results show that ChatGPT performs reasonably well for some of the introductory programming tasks and student errors, which means that students can potentially benefit. However, educators should provide guidance on how to use the provided feedback, as it can contain misleading information for novices.

Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses

M. Sakr Marshall An C. Bogart + 2 lainnya

15 Juni 2023

This paper studies recent developments in large language models’ (LLM) abilities to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. The emergence of ChatGPT resulted in heated debates of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming classes (e.g., cheating). Recent studies show that while the technology performs surprisingly well on diverse sets of assessment instruments employed in typical programming classes the performance is usually not sufficient to pass the courses. The release of GPT-4 largely emphasized notable improvements in the capabilities related to handling assessments originally designed for human test-takers. This study is the necessary analysis in the context of this ongoing transition towards mature generative AI systems. Specifically, we report the performance of GPT-4, comparing it to the previous generations of GPT models, on three Python courses with assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Additionally, we analyze the assessments that were not handled well by GPT-4 to understand the current limitations of the model, as well as its capabilities to leverage feedback provided by an auto-grader. We found that the GPT models evolved from completely failing the typical programming class’ assessments (the original GPT-3) to confidently passing the courses with no human involvement (GPT-4). While we identified certain limitations in GPT-4’s handling of MCQs and coding exercises, the rate of improvement across the recent generations of GPT models strongly suggests their potential to handle almost any type of assessment widely used in higher education programming courses. These findings could be leveraged by educators and institutions to adapt the design of programming assessments as well as to fuel the necessary discussions into how programming classes should be updated to reflect the recent technological developments. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use widely accessible technology that can be utilized by learners to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments.

Daftar Referensi

0 referensi

Tidak ada referensi ditemukan.

Artikel yang Mensitasi

0 sitasi

Tidak ada artikel yang mensitasi.