Abstract:
This thesis presents ConcurBench, a novel benchmark framework designed to evaluate the capa- bilities of Large Language Models (LLMs) in generating concurrent code. Concurrent program- ming remains one of the most challenging domains in software development, requiring careful attention to thread safety, synchronization, and race conditions. As LLMs increasingly become part of software development workflows, understanding their ability to generate correct concur- rent code is crucial. ConcurBench addresses this need by providing a comprehensive evaluation framework that ex- tracts high-quality concurrent functions from popular open-source repositories, annotates them with natural language requirements, and tests LLMs’ ability to regenerate these functions with varying levels of context. The framework implements a multi-level context evaluation approach, testing LLMs with no context (function signature only), local context (surrounding function- s/imports), and full context (entire file context). The thesis details the design and implementation of ConcurBench’s pipeline architecture, in- cluding repository discovery and collection, function extraction, test discovery, LLM annotation, function generation, and evaluation. Key innovations include a dynamic test harness generation system that can compile and test LLM-generated code against original implementations without modification, and an orchestration wrapper script that enables scalable, automated evaluation across multiple functions and LLMs. Experimental results demonstrate that context significantly impacts LLMs’ ability to generate correct concurrent code, with full context providing substantial improvements in functional correctness. The benchmark provides valuable insights into the strengths and limitations of current LLMs in handling concurrent programming tasks and establishes a methodology for evaluating future advancements in this domain.