mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
feat: add code_debug community environment
This commit is contained in:
parent
c421582b6f
commit
590e8a1ef2
4 changed files with 930 additions and 0 deletions
81
environments/community/code_debug_env/README.md
Normal file
81
environments/community/code_debug_env/README.md
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
# Code Debug Environment
|
||||
|
||||
An Atropos RL environment for training LLMs to debug and fix buggy Python code.
|
||||
|
||||
## Overview
|
||||
|
||||
This environment uses the [HumanEvalFix](https://huggingface.co/datasets/bigcode/humanevalfix-python) dataset, which contains 164 buggy Python functions with associated test suites. The model receives a buggy function and must output the corrected version inside `\boxed{}`. Scoring is done by executing the fixed code against the original test cases.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
code_debug_env.py # Main env (extends BaseEnv)
|
||||
code_executor.py # Safe subprocess execution with timeout
|
||||
test_code_debug.py # Unit tests
|
||||
README.md # This file
|
||||
```
|
||||
|
||||
## Reward Design
|
||||
|
||||
| Outcome | Score | Description |
|
||||
|---------|-------|-------------|
|
||||
| All tests pass | **1.0** | Perfect fix |
|
||||
| Partial improvement | **-0.5 to 0.9** | More tests pass than buggy version |
|
||||
| No improvement | **-0.5** | Code runs but doesn't fix anything |
|
||||
| Compilation error / regression | **-1.0** | Fix is worse than the original |
|
||||
|
||||
When all rollouts in a group score 1.0, a **length penalty** is applied to encourage concise solutions (same pattern as `sql_query_env`).
|
||||
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
# Install dependencies (datasets is the only extra)
|
||||
pip install datasets
|
||||
|
||||
# Run tests
|
||||
cd environments/community/code_debug_env
|
||||
python -m pytest test_code_debug.py -v
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Process mode (offline data generation)
|
||||
python code_debug_env.py process \
|
||||
--env.data_path_to_save_groups data/code_debug.jsonl \
|
||||
--env.group_size 8 \
|
||||
--openai.base_url http://localhost:8000/v1 \
|
||||
--openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
|
||||
|
||||
# Serve mode (online RL training)
|
||||
python code_debug_env.py serve \
|
||||
--openai.base_url http://localhost:9001/v1 \
|
||||
--openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
|
||||
|
||||
# Evaluate mode
|
||||
python code_debug_env.py evaluate \
|
||||
--openai.base_url http://localhost:8000/v1 \
|
||||
--openai.model_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
|
||||
```
|
||||
|
||||
## WandB Metrics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `train/percent_correct` | Fraction of rollouts that pass all tests |
|
||||
| `train/avg_score` | Average reward across rollouts |
|
||||
| `train/partial_fix_rate` | Fraction of rollouts that partially fix the code |
|
||||
| `eval/percent_correct` | Eval set accuracy |
|
||||
|
||||
## Dataset
|
||||
|
||||
- **Source**: [bigcode/humanevalfix-python](https://huggingface.co/datasets/bigcode/humanevalfix-python)
|
||||
- **License**: Apache 2.0
|
||||
- **Size**: 164 problems
|
||||
- **Split**: 80% train / 20% test
|
||||
|
||||
## Compute Footprint
|
||||
|
||||
- **RAM**: < 1 GB (dataset is small, execution is in subprocess)
|
||||
- **CPU**: < 5s per verification (subprocess with 10s timeout)
|
||||
- **GPU**: Only needed for the inference server
|
||||
Loading…
Add table
Add a link
Reference in a new issue