🔔 News

🥳[2025-01-22]: ConvCodeWorld has been accepted to ICLR2025! (OpenReview 🔗)
🌎[2024-10-02]: We have open-sourced ConvCodeWorld and ConvCodeBench.

Introduction

Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold: First, we introduce ConvCodeWorld, a novel and reproducible environment for benchmarking interactive code generation. ConvCodeWorld simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; (c) verbal feedback generated by GPT-4o with different levels of expertise. Second, we introduce ConvCodeBench, a fast, static version of benchmark that uses pre-generated feedback logs, eliminating the need for costly dynamic verbal feedback generation while maintaining strong Spearman's rank correlations (0.82 to 0.99) with ConvCodeWorld. Third, extensive evaluations of both closed-source and open-source LLMs on ConvCodeWorld reveal key insights: (a) LLM performance varies significantly based on the feedback provided; (b) Weaker LLMs, with sufficient feedback, can outperform single-turn results of state-of-the-art LLMs without feedback; (c) Training on a specific feedback combination can limit an LLM's ability to utilize unseen combinations; (d) LLMs solve problems in fewer turns (high MRR) may not solve as many problems overall (high Recall), and vice versa.

ConvCodeWorld provides novel, reproducible environments designed to assess the multi-turn code generation capabilities of LLMs. This environment incorporates a comprehensive categorization of feedback types that reflect diverse real-world programming scenarios.

1. Compilation Feedback indicates whether the code compiles successfully or provides error messages. 2. Execution Feedback assesses the code's runtime behavior, further divided into: (a) Partial Test Coverage: practical settings where only a part of the test cases is available; (b) Full Test Coverage: when annotated test cases manage near complete test coverage (average branch coverage of 99%), including most edge cases. 3. Verbal Feedback: To ensure controllable & reproducible feedback, we employ GPT-4o to generate verbal feedback, categorized by expertise: (a) Novice-level Feedback simulates interactions with novice users who can identify issues but may not know how to fix them; (b) Expert-level Feedback represents guidance from experienced programmers who can provide specific suggestions for code improvement.

ConvCodeBench is a cost-effective benchmark that strongly correlates to ConvCodeWorld. ConvCodeBench uses logs from ConvCodeWorld generated by a reference LLM (CodeLlama-7b-Instruct-hf) alongside corresponding simulated verbal feedback, to assess the target LLMs' ability to refine code at each turn, while keeping the previous interactions frozen. ConvCodeBench is more cost-effective, efficient, and reproducible, as it eliminates the need to re-generate verbal feedback at each turn.

ConvCodeWorld: Main Results

ConvCodeBench: Main Results

BibTeX


          @inproceedings{
          han2025convcodeworld,
          title={ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments},
          author={Hojae Han and Seung-won Hwang and Rajhans Samdani and Yuxiong He},
          booktitle={The Thirteenth International Conference on Learning Representations},
          year={2025},
          url={https://openreview.net/forum?id=rpouyo09V0}
          }

Benchmarking Conversational Code Generation in
Reproducible Feedback Environments

Hojae Han, Seung-won Hwang, Rajhans Samdani, Yuxiong He

🔔 News

Introduction

Available Feedback Types

ConvCodeWorld: Main Results

ConvCodeBench: Main Results

BibTeX