feat: add code_debug community environment

This commit is contained in:
RUFFY-369 2026-03-24 13:05:15 +05:30
parent c421582b6f
commit 590e8a1ef2
4 changed files with 930 additions and 0 deletions

View file

@ -0,0 +1,81 @@
# Code Debug Environment
An Atropos RL environment for training LLMs to debug and fix buggy Python code.
## Overview
This environment uses the [HumanEvalFix](https://huggingface.co/datasets/bigcode/humanevalfix-python) dataset, which contains 164 buggy Python functions with associated test suites. The model receives a buggy function and must output the corrected version inside `\boxed{}`. Scoring is done by executing the fixed code against the original test cases.
## Architecture
```
code_debug_env.py # Main env (extends BaseEnv)
code_executor.py # Safe subprocess execution with timeout
test_code_debug.py # Unit tests
README.md # This file
```
## Reward Design
| Outcome | Score | Description |
|---------|-------|-------------|
| All tests pass | **1.0** | Perfect fix |
| Partial improvement | **-0.5 to 0.9** | More tests pass than buggy version |
| No improvement | **-0.5** | Code runs but doesn't fix anything |
| Compilation error / regression | **-1.0** | Fix is worse than the original |
When all rollouts in a group score 1.0, a **length penalty** is applied to encourage concise solutions (same pattern as `sql_query_env`).
## Setup
```bash
# Install dependencies (datasets is the only extra)
pip install datasets
# Run tests
cd environments/community/code_debug_env
python -m pytest test_code_debug.py -v
```
## Usage
```bash
# Process mode (offline data generation)
python code_debug_env.py process \
--env.data_path_to_save_groups data/code_debug.jsonl \
--env.group_size 8 \
--openai.base_url http://localhost:8000/v1 \
--openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
# Serve mode (online RL training)
python code_debug_env.py serve \
--openai.base_url http://localhost:9001/v1 \
--openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
# Evaluate mode
python code_debug_env.py evaluate \
--openai.base_url http://localhost:8000/v1 \
--openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
```
## WandB Metrics
| Metric | Description |
|--------|-------------|
| `train/percent_correct` | Fraction of rollouts that pass all tests |
| `train/avg_score` | Average reward across rollouts |
| `train/partial_fix_rate` | Fraction of rollouts that partially fix the code |
| `eval/percent_correct` | Eval set accuracy |
## Dataset
- **Source**: [bigcode/humanevalfix-python](https://huggingface.co/datasets/bigcode/humanevalfix-python)
- **License**: Apache 2.0
- **Size**: 164 problems
- **Split**: 80% train / 20% test
## Compute Footprint
- **RAM**: < 1 GB (dataset is small, execution is in subprocess)
- **CPU**: < 5s per verification (subprocess with 10s timeout)
- **GPU**: Only needed for the inference server