mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
| .. | ||
| code_debug_env.py | ||
| code_executor.py | ||
| README.md | ||
| test_code_debug.py | ||
Code Debug Environment
An Atropos RL environment for training LLMs to debug and fix buggy Python code.
Overview
This environment uses the HumanEvalPack dataset (Python subset, HumanEvalFix task), which contains 164 buggy Python functions with associated test suites. The model receives a buggy function and must output the corrected version inside \boxed{}. Scoring is done by executing the fixed code against the original test cases.
Architecture
code_debug_env.py # Main env (extends BaseEnv)
code_executor.py # Safe subprocess execution with timeout
test_code_debug.py # Unit tests
README.md # This file
Reward Design
| Outcome | Score | Description |
|---|---|---|
| All tests pass | 1.0 | Perfect fix |
| Partial improvement | -0.5 to 0.9 | More tests pass than buggy version |
| No improvement | -0.5 | Code runs but doesn't fix anything |
| Compilation error / regression | -1.0 | Fix is worse than the original |
When all rollouts in a group score 1.0, a length penalty is applied to encourage concise solutions (same pattern as sql_query_env).
Setup
# Install dependencies (datasets is the only extra)
pip install datasets
# Run tests
cd environments/community/code_debug_env
python -m pytest test_code_debug.py -v
Usage
# Process mode (offline data generation)
python code_debug_env.py process \
--env.data_path_to_save_groups data/code_debug.jsonl \
--env.group_size 8 \
--openai.base_url http://localhost:8000/v1 \
--openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
# Serve mode (online RL training)
python code_debug_env.py serve \
--openai.base_url http://localhost:9001/v1 \
--openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
# Evaluate mode
python code_debug_env.py evaluate \
--openai.base_url http://localhost:8000/v1 \
--openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
WandB Metrics
| Metric | Description |
|---|---|
train/percent_correct |
Fraction of rollouts that pass all tests |
train/avg_score |
Average reward across rollouts |
train/partial_fix_rate |
Fraction of rollouts that partially fix the code |
eval/percent_correct |
Eval set accuracy |
Dataset
- Source: bigcode/humanevalpack (Python subset)
- License: Apache 2.0
- Size: 164 problems
- Split: 80% train / 20% test
Compute Footprint
- RAM: < 1 GB (dataset is small, execution is in subprocess)
- CPU: < 5s per verification (subprocess with 10s timeout)
- GPU: Only needed for the inference server