atropos/environments/community/code_debug_env
2026-03-28 00:21:49 +05:30
..
code_debug_env.py refactor:final production-ready audit; remove debug artifacts and non-ASCII characters 2026-03-28 00:21:49 +05:30
code_executor.py refactor:final production-ready audit; remove debug artifacts and non-ASCII characters 2026-03-28 00:21:49 +05:30
README.md fix: correct dataset name to bigcode/humanevalpack 2026-03-24 15:32:25 +05:30
test_code_debug.py feat: add code_debug community environment 2026-03-24 13:05:15 +05:30

Code Debug Environment

An Atropos RL environment for training LLMs to debug and fix buggy Python code.

Overview

This environment uses the HumanEvalPack dataset (Python subset, HumanEvalFix task), which contains 164 buggy Python functions with associated test suites. The model receives a buggy function and must output the corrected version inside \boxed{}. Scoring is done by executing the fixed code against the original test cases.

Architecture

code_debug_env.py    # Main env (extends BaseEnv)
code_executor.py     # Safe subprocess execution with timeout
test_code_debug.py   # Unit tests
README.md            # This file

Reward Design

Outcome Score Description
All tests pass 1.0 Perfect fix
Partial improvement -0.5 to 0.9 More tests pass than buggy version
No improvement -0.5 Code runs but doesn't fix anything
Compilation error / regression -1.0 Fix is worse than the original

When all rollouts in a group score 1.0, a length penalty is applied to encourage concise solutions (same pattern as sql_query_env).

Setup

# Install dependencies (datasets is the only extra)
pip install datasets

# Run tests
cd environments/community/code_debug_env
python -m pytest test_code_debug.py -v

Usage

# Process mode (offline data generation)
python code_debug_env.py process \
    --env.data_path_to_save_groups data/code_debug.jsonl \
    --env.group_size 8 \
    --openai.base_url http://localhost:8000/v1 \
    --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"

# Serve mode (online RL training)
python code_debug_env.py serve \
    --openai.base_url http://localhost:9001/v1 \
    --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"

# Evaluate mode
python code_debug_env.py evaluate \
    --openai.base_url http://localhost:8000/v1 \
    --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"

WandB Metrics

Metric Description
train/percent_correct Fraction of rollouts that pass all tests
train/avg_score Average reward across rollouts
train/partial_fix_rate Fraction of rollouts that partially fix the code
eval/percent_correct Eval set accuracy

Dataset

  • Source: bigcode/humanevalpack (Python subset)
  • License: Apache 2.0
  • Size: 164 problems
  • Split: 80% train / 20% test

Compute Footprint

  • RAM: < 1 GB (dataset is small, execution is in subprocess)
  • CPU: < 5s per verification (subprocess with 10s timeout)
  • GPU: Only needed for the inference server