feat: add code_debug community environment

2026-04-19 12:57:58 +00:00 · 2026-03-24 13:05:15 +05:30 · 2026-03-24 13:05:15 +05:30 · 590e8a1ef2
commit 590e8a1ef2
parent c421582b6f
4 changed files with 930 additions and 0 deletions
--- a/environments/community/code_debug_env/README.md
+++ b/environments/community/code_debug_env/README.md
@ -0,0 +1,81 @@
+# Code Debug Environment
+
+An Atropos RL environment for training LLMs to debug and fix buggy Python code.
+
+## Overview
+
+This environment uses the [HumanEvalFix](https://huggingface.co/datasets/bigcode/humanevalfix-python) dataset, which contains 164 buggy Python functions with associated test suites. The model receives a buggy function and must output the corrected version inside `\boxed{}`. Scoring is done by executing the fixed code against the original test cases.
+
+## Architecture
+
+```
+code_debug_env.py    # Main env (extends BaseEnv)
+code_executor.py     # Safe subprocess execution with timeout
+test_code_debug.py   # Unit tests
+README.md            # This file
+```
+
+## Reward Design
+
+| Outcome | Score | Description |
+|---------|-------|-------------|
+| All tests pass | **1.0** | Perfect fix |
+| Partial improvement | **-0.5 to 0.9** | More tests pass than buggy version |
+| No improvement | **-0.5** | Code runs but doesn't fix anything |
+| Compilation error / regression | **-1.0** | Fix is worse than the original |
+
+When all rollouts in a group score 1.0, a **length penalty** is applied to encourage concise solutions (same pattern as `sql_query_env`).
+
+## Setup
+
+```bash
+# Install dependencies (datasets is the only extra)
+pip install datasets
+
+# Run tests
+cd environments/community/code_debug_env
+python -m pytest test_code_debug.py -v
+```
+
+## Usage
+
+```bash
+# Process mode (offline data generation)
+python code_debug_env.py process \
+    --env.data_path_to_save_groups data/code_debug.jsonl \
+    --env.group_size 8 \
+    --openai.base_url http://localhost:8000/v1 \
+    --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+
+# Serve mode (online RL training)
+python code_debug_env.py serve \
+    --openai.base_url http://localhost:9001/v1 \
+    --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+
+# Evaluate mode
+python code_debug_env.py evaluate \
+    --openai.base_url http://localhost:8000/v1 \
+    --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+```
+
+## WandB Metrics
+
+| Metric | Description |
+|--------|-------------|
+| `train/percent_correct` | Fraction of rollouts that pass all tests |
+| `train/avg_score` | Average reward across rollouts |
+| `train/partial_fix_rate` | Fraction of rollouts that partially fix the code |
+| `eval/percent_correct` | Eval set accuracy |
+
+## Dataset
+
+- **Source**: [bigcode/humanevalfix-python](https://huggingface.co/datasets/bigcode/humanevalfix-python)
+- **License**: Apache 2.0
+- **Size**: 164 problems
+- **Split**: 80% train / 20% test
+
+## Compute Footprint
+
+- **RAM**: < 1 GB (dataset is small, execution is in subprocess)
+- **CPU**: < 5s per verification (subprocess with 10s timeout)
+- **GPU**: Only needed for the inference server