# Code Debug Environment An Atropos RL environment for training LLMs to debug and fix buggy Python code. ## Overview This environment uses the [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack) dataset (Python subset, HumanEvalFix task), which contains 164 buggy Python functions with associated test suites. The model receives a buggy function and must output the corrected version inside `\boxed{}`. Scoring is done by executing the fixed code against the original test cases. ## Architecture ``` code_debug_env.py # Main env (extends BaseEnv) code_executor.py # Safe subprocess execution with timeout test_code_debug.py # Unit tests README.md # This file ``` ## Reward Design | Outcome | Score | Description | |---------|-------|-------------| | All tests pass | **1.0** | Perfect fix | | Partial improvement | **-0.5 to 0.9** | More tests pass than buggy version | | No improvement | **-0.5** | Code runs but doesn't fix anything | | Compilation error / regression | **-1.0** | Fix is worse than the original | When all rollouts in a group score 1.0, a **length penalty** is applied to encourage concise solutions (same pattern as `sql_query_env`). ## Setup ```bash # Install dependencies (datasets is the only extra) pip install datasets # Run tests cd environments/community/code_debug_env python -m pytest test_code_debug.py -v ``` ## Usage ```bash # Process mode (offline data generation) python code_debug_env.py process \ --env.data_path_to_save_groups data/code_debug.jsonl \ --env.group_size 8 \ --openai.base_url http://localhost:8000/v1 \ --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview" # Serve mode (online RL training) python code_debug_env.py serve \ --openai.base_url http://localhost:9001/v1 \ --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview" # Evaluate mode python code_debug_env.py evaluate \ --openai.base_url http://localhost:8000/v1 \ --openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview" ``` ## WandB Metrics | Metric | Description | |--------|-------------| | `train/percent_correct` | Fraction of rollouts that pass all tests | | `train/avg_score` | Average reward across rollouts | | `train/partial_fix_rate` | Fraction of rollouts that partially fix the code | | `eval/percent_correct` | Eval set accuracy | ## Dataset - **Source**: [bigcode/humanevalpack](https://huggingface.co/datasets/bigcode/humanevalpack) (Python subset) - **License**: Apache 2.0 - **Size**: 164 problems - **Split**: 80% train / 20% test ## Compute Footprint - **RAM**: < 1 GB (dataset is small, execution is in subprocess) - **CPU**: < 5s per verification (subprocess with 10s timeout) - **GPU**: Only needed for the inference server