posted on 2025-08-01, 00:00authored byLorenzo Gallone
This thesis presents a test-driven framework for enhancing the reliability of code generated
by Large Language Models (LLMs), focusing on real-world applicability and minimal developer
assistance. The system is designed to simulate a realistic development environment where
no ground-truth implementations are available to the model, relying exclusively on textual
artifacts such as documentation, docstrings, and test outcomes. This constraint ensures that
every generated function is derived from semantic understanding rather than replication or
pattern-matching.
A core innovation of this work is the integration of an iterative refinement loop, which
introduces structured feedback into the code generation process. After producing an initial
function from a natural language prompt, the model’s output is immediately tested. If failures
occur, relevant error signals are extracted and used to update the prompt, allowing the model
to revise its solution. This loop continues until the implementation passes all associated tests
or a retry limit is reached. The system thus mirrors a human-like workflow of test-driven
development and debugging.
To assess the contribution of this iterative process, the same framework is also evaluated in
a non-iterative configuration, where each function is generated only once based on its prompt
and tested without revision.
The evaluation is conducted on entire Python repositories—not isolated functions—making
the task significantly more complex. Functions are embedded in larger software structures,
ix
SUMMARY (continued)
depend on shared state or class behavior, and are often indirectly tested through multi-layered
scenarios. The system parses these repositories to extract structural metadata, resolve function-
to-test mappings, and build context-aware prompts that support both initial generation and
iterative correction.
The results demonstrate that embedding LLMs into a feedback-rich environment substan-
tially increases their capacity to produce robust, test-passing code. Despite added computa-
tional cost, the iterative approach leads to higher success rates across a diverse range of code-
bases, showing that language models, when guided by empirical signals and properly contextu-
alized, can evolve from static generators into adaptive agents capable of producing functionally
correct and maintainable code.