diff --git a/environments/community/mcp_tool_calling/GRPO_README.md b/environments/community/mcp_tool_calling/GRPO_README.md index 0ccd7a3d..85b2fbc5 100644 --- a/environments/community/mcp_tool_calling/GRPO_README.md +++ b/environments/community/mcp_tool_calling/GRPO_README.md @@ -1,5 +1,108 @@ # GRPO Example Trainer +This guide explains how to run the `example_trainer` integration with Atropos using GRPO. + +The trainer is a reference implementation for end-to-end wiring (`environment -> run-api -> rollout server -> optimizer`), with multiple synchronization modes with vLLM. + +## Supported Modes + +- `shared_vllm`: single-copy training via CUDA IPC (trainer updates shared vLLM tensors in place) +- `lora_only`: LoRA adapter training with HTTP hot-swap (slow due to eager mode) +- `lora_restart`: LoRA adapter training with periodic vLLM restart (faster than `lora_only`) +- `none`: legacy full-checkpoint flow with vLLM reloads + +## Prerequisites + +1. Python 3.10+ +2. CUDA-capable PyTorch environment for GPU training +3. Atropos API server available (`run-api`) +4. An environment process producing trajectories (for example GSM8K server) + +## Installation + +From repository root: + +```bash +pip install -e ".[example_trainer]" +``` + +Optional (all extras): + +```bash +pip install -e ".[all]" +``` + +## CLI Entry Points + +After install, you can use either module invocation or script entrypoints: + +- `python -m example_trainer.grpo` or `atropos-grpo` +- `python -m example_trainer.run` or `atropos-grpo-run` + +## Minimal End-to-End Startup + +### 1) Start Atropos API + +```bash +run-api --port 8002 +``` + +### 2) Start an environment + +```bash +python environments/gsm8k_server.py serve \ + --env.rollout_server_url "http://localhost:8002" \ + --openai.server_type vllm \ + --openai.base_url "http://localhost:9001/v1" \ + --openai.api_key "dummy" +``` + +### 3) Start vLLM server (shared-weights example) + +```bash +VLLM_ENABLE_SHARED_WEIGHTS=1 LOGDIR=/tmp/grpo_training \ +python -m example_trainer.vllm_api_server \ + --model Qwen/Qwen3-1.7B-Base \ + --port 9001 \ + --gpu-memory-utilization 0.45 \ + --enforce-eager +``` + +### 4) Start trainer + +```bash +atropos-grpo \ + --model-name Qwen/Qwen3-1.7B-Base \ + --weight-bridge-mode shared_vllm \ + --vllm-port 9001 \ + --vllm-config-path /tmp/grpo_training/vllm_bridge_config.json \ + --atropos-url "http://localhost:8002" \ + --batch-size 1 \ + --gradient-accumulation-steps 64 \ + --warmup-steps 5 \ + --training-steps 30 \ + --kl-coef 0.0 \ + --clip-eps 0.2 +``` + +## Objective Notes + +- GRPO uses rollout/inference logprobs (`pi_old`) for importance-ratio computation. +- The optional KL-like term is sampled-token regularization against rollout policy logprobs, not a separate frozen-reference-model KL. + +## Outputs + +- Trainer logs to stdout (and optional W&B if enabled) +- Checkpoints under `--save-path` +- Mode-specific logs/checkpoints when using matrix/orchestration scripts + +## Troubleshooting + +- If vLLM health checks time out, inspect `vllm.log`, `trainer.log`, and `env.log`. +- If targeted shared-layer runs lose gradients, ensure non-reentrant checkpointing is enabled in shared mode. +- If environment workers time out at 600s, reduce env concurrency (`--env.max_num_workers_per_node`) and batch pressure. +# GRPO Example Trainer + This directory contains an example script (`grpo.py`) demonstrating how to integrate a custom training loop with the Atropos API for reinforcement learning using the GRPO (Group Relative Policy Optimization) algorithm. **Note: Example trainer does not support multimodal training out of the box. As other trainers add support for Atropos, we will list them in the main readme, some of which may support multimodal RL - please check the main repo readme for any updates.** @@ -68,7 +171,7 @@ Once the prerequisites are met and configuration is set: ```bash # Install dependencies -pip install -r example_trainer/requirements.txt +pip install -e ".[example_trainer]" # Run the trainer directly (basic test) python example_trainer/grpo.py diff --git a/example_trainer/README.md b/example_trainer/README.md index 6dd5e867..ddb9abc7 100644 --- a/example_trainer/README.md +++ b/example_trainer/README.md @@ -107,7 +107,9 @@ The `lora_only` mode requires `--enforce-eager` which **disables CUDA graphs**, ## Quick Start: LoRA Training (Recommended) ### Step 1: Install Dependencies -- They are listed in the requirements.txt file that you can see +- Install from `pyproject.toml` extras: + - `pip install -e ".[example_trainer]"` + - or everything: `pip install -e ".[all]"` ### Step 2: Start All Components diff --git a/example_trainer/requirements.txt b/example_trainer/requirements.txt deleted file mode 100644 index 0313a941..00000000 --- a/example_trainer/requirements.txt +++ /dev/null @@ -1,8 +0,0 @@ -vllm -torch -transformers -datasets -accelerate -peft -requests -wandb diff --git a/llms.txt b/llms.txt index ab8a2d10..b2e51132 100644 --- a/llms.txt +++ b/llms.txt @@ -299,7 +299,7 @@ The `example_trainer/` directory provides `grpo.py`, a script demonstrating inte 1. Python 3.8+ (Python 3.10+ recommended for Atropos overall). 2. Running Atropos API server (default: `http://localhost:8000`). Accessible via `run-api`. -3. Required Python packages: `torch`, `transformers`, `vllm`, `pydantic`, `numpy`, `requests`, `tenacity`, `wandb` (optional). Install via `pip install -r example_trainer/requirements.txt` or `pip install -e .[examples]`. +3. Required Python packages: `torch`, `transformers`, `vllm`, `pydantic`, `numpy`, `requests`, `tenacity`, `wandb` (optional). Install via `pip install -e ".[example_trainer]"` (or `pip install -e ".[all]"`). 4. A running Atropos environment (e.g., `python environments/gsm8k_server.py serve --slurm False`). ### 9.2. Setup diff --git a/pyproject.toml b/pyproject.toml index eaecf363..d291021b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -40,11 +40,18 @@ atropos-grpo-run = "example_trainer.run:main" [project.optional-dependencies] all = [ - "atroposlib[dev,examples]" + "atroposlib[dev,examples,example_trainer]" ] rewardfns = [ "torch" ] +example_trainer = [ + "atroposlib[rewardfns]", + "vllm", + "accelerate", + "peft", + "requests", +] dev = [ "pytest", "pytest-asyncio",