dc.description.abstract |
Generating unit tests is a crucial undertaking in software development, demanding substantial time and effort from programmers. The advent of Large Language Models (LLMs) introduces a novel avenue for unit test script generation. This research aims to experimentally investigate the effectiveness of LLMs, specifically exemplified by ChatGPT, for generating unit test scripts for Python programs, and how the generated test cases compare with those generated by an existing generator (Pynguin). For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. The generated test cases are evaluated based on criteria such as coverage, correctness, and readability. Through our experiments, we observed that assertions generated by Chatgpt were not always correct, had issues like compilation errors and sometimes were not comprehensive in testing the core logic. For small code units (approximately 100 lines of code (LOC)), ChatGPT-produced tests exhibit performance on par with Pynguin in terms of coverage For larger units of 100 to 300 LOC. ChatGPT’s ability to generate tests is superior to Pynguin, as the latter sometimes was not able to generate test cases. The observed minimal overlap in missed statements between ChatGPT and Pynguin suggests the potential for a synergistic combination of both tools to enhance unit test generation performance. We also study how the performance of ChatGPT can be improved by prompt engineering – asking it to improve the test cases repeatedly. We observed that, through iteratively prompting ChatGPT, improvement can be obtained in the coverage, which reaches a saturation after about 4 iterations. |
en_US |