Unit test generation using LLMs: a comparative performance analysis of autogeneration tools

Gandhi, Tarushi; Jalote, Pankaj (Advisor)

dc.contributor.author	Gandhi, Tarushi
dc.contributor.author	Jalote, Pankaj (Advisor)
dc.date.accessioned	2024-05-20T07:17:35Z
dc.date.available	2024-05-20T07:17:35Z
dc.date.issued	2023-11-29
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1531
dc.description.abstract	Generating unit tests is a crucial undertaking in software development, demanding substantial time and effort from programmers. The advent of Large Language Models (LLMs) introduces a novel avenue for unit test script generation. This research aims to experimentally investigate the effectiveness of LLMs, specifically exemplified by ChatGPT, for generating unit test scripts for Python programs, and how the generated test cases compare with those generated by an existing generator (Pynguin). For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. The generated test cases are evaluated based on criteria such as coverage, correctness, and readability. Through our experiments, we observed that assertions generated by Chatgpt were not always correct, had issues like compilation errors and sometimes were not comprehensive in testing the core logic. For small code units (approximately 100 lines of code (LOC)), ChatGPT-produced tests exhibit performance on par with Pynguin in terms of coverage For larger units of 100 to 300 LOC. ChatGPT’s ability to generate tests is superior to Pynguin, as the latter sometimes was not able to generate test cases. The observed minimal overlap in missed statements between ChatGPT and Pynguin suggests the potential for a synergistic combination of both tools to enhance unit test generation performance. We also study how the performance of ChatGPT can be improved by prompt engineering – asking it to improve the test cases repeatedly. We observed that, through iteratively prompting ChatGPT, improvement can be obtained in the coverage, which reaches a saturation after about 4 iterations.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	Large Language Models	en_US
dc.subject	ChatGPT	en_US
dc.subject	Unit Test Generation	en_US
dc.subject	Coverage	en_US
dc.title	Unit test generation using LLMs: a comparative performance analysis of autogeneration tools	en_US
dc.type	Other	en_US