mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

History

RUFFY-369 8cd30c3703 refactor:final production-ready audit; remove debug artifacts and non-ASCII characters		2026-03-28 00:21:49 +05:30
..
code_debug_env.py	refactor:final production-ready audit; remove debug artifacts and non-ASCII characters	2026-03-28 00:21:49 +05:30
code_executor.py	refactor:final production-ready audit; remove debug artifacts and non-ASCII characters	2026-03-28 00:21:49 +05:30
README.md	fix: correct dataset name to bigcode/humanevalpack	2026-03-24 15:32:25 +05:30
test_code_debug.py	feat: add code_debug community environment	2026-03-24 13:05:15 +05:30

README.md

Code Debug Environment

An Atropos RL environment for training LLMs to debug and fix buggy Python code.

Overview

This environment uses the HumanEvalPack dataset (Python subset, HumanEvalFix task), which contains 164 buggy Python functions with associated test suites. The model receives a buggy function and must output the corrected version inside \boxed{}. Scoring is done by executing the fixed code against the original test cases.

Architecture

code_debug_env.py    # Main env (extends BaseEnv)
code_executor.py     # Safe subprocess execution with timeout
test_code_debug.py   # Unit tests
README.md            # This file

Reward Design

Outcome	Score	Description
All tests pass	1.0	Perfect fix
Partial improvement	-0.5 to 0.9	More tests pass than buggy version
No improvement	-0.5	Code runs but doesn't fix anything
Compilation error / regression	-1.0	Fix is worse than the original

When all rollouts in a group score 1.0, a length penalty is applied to encourage concise solutions (same pattern as sql_query_env).

Setup

# Install dependencies (datasets is the only extra)
pip install datasets

# Run tests
cd environments/community/code_debug_env
python -m pytest test_code_debug.py -v

Usage

# Process mode (offline data generation)
python code_debug_env.py process \
    --env.data_path_to_save_groups data/code_debug.jsonl \
    --env.group_size 8 \
    --openai.base_url http://localhost:8000/v1 \
    --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"

# Serve mode (online RL training)
python code_debug_env.py serve \
    --openai.base_url http://localhost:9001/v1 \
    --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"

# Evaluate mode
python code_debug_env.py evaluate \
    --openai.base_url http://localhost:8000/v1 \
    --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"

WandB Metrics

Metric	Description
`train/percent_correct`	Fraction of rollouts that pass all tests
`train/avg_score`	Average reward across rollouts
`train/partial_fix_rate`	Fraction of rollouts that partially fix the code
`eval/percent_correct`	Eval set accuracy

Dataset

Source: bigcode/humanevalpack (Python subset)
License: Apache 2.0
Size: 164 problems
Split: 80% train / 20% test

Compute Footprint

RAM: < 1 GB (dataset is small, execution is in subprocess)
CPU: < 5s per verification (subprocess with 10s timeout)
GPU: Only needed for the inference server