AI-Based Automated Assessment Tools for Code Quality
Aims
As university cohort sizes increase, so does the grading workload for instructors, reducing the amount of time that they can spend providing support to struggling students. One method to mitigate the increased workload is to have multiple teaching assistants grade the assignments. However, utilising multiple graders to grade a singular assignment can introduce issues regarding the consistency of the grades awarded and the quality of the feedback. It also increases cost and is therefore not always practical. This is especially prevalent in subjective aspects, such as essay writing or when assessing quality aspects of technical artefacts, such as code quality of computer programs. Code quality can be related to the writing quality within essays, with aspects of readability, correct structure and ease of comprehension all being critical aspects of both.
In this study, we aim to develop a generative AI-based automatic assessment tool for code quality and to investigate the quality of the AI automatic assessment compared to human graders. We will ask teaching assistants (typically postgraduate research students) to grade and provide feedback on several historical programming assignment submissions, complete grading diaries and discuss their overall grading experience in a post-study semi-structured interview.
We will then use the average of the human provided grades as a benchmark and ground truth data to train, and develop a generative AI automatic assessment tool for providing grades and feedback, with the human marking being essential to evaluate the AI tool’s accuracy.
We fine-tune existing generative AI models, such as GPT and CodeBert, to assess the code quality of programming assignments, including the readability, maintainability, and the documentation quality of the submissions.
After development of the AI tool, we will evaluate the tool by first comparing the awarded grades and feedback from the AI tool to our benchmark dataset of human graded assignments, and then performing a student and instructor survey on the usefulness of the AI tool.
We expect that such a tool, if successful, will be usable at a large number of universities and schools around the world, and across different disciplines including computer science, mathematics and engineering. Automated assessment of code quality addresses a very common and known problem in programming education.
In addition, the public availability of an anonymised and graded assessment dataset is in itself a valuable contribution to the international computing education research community. It can be used by other research groups to test and compare the effectiveness and accuracy of automated assessment and feedback tools.