atropos/environments/community/mcp_tool_calling/GRPO_README.md
Jai Suphavadeeprasit 585244559e more readme changes
2026-03-02 11:18:52 -05:00

102 lines
2.8 KiB
Markdown

# GRPO Example Trainer
This guide explains how to run the `example_trainer` integration with Atropos using GRPO.
The trainer is a reference implementation for end-to-end wiring (`environment -> run-api -> rollout server -> optimizer`), with multiple synchronization modes with vLLM.
## Supported Modes
- `shared_vllm`: single-copy training via CUDA IPC (trainer updates shared vLLM tensors in place)
- `lora_only`: LoRA adapter training with HTTP hot-swap (slow due to eager mode)
- `lora_restart`: LoRA adapter training with periodic vLLM restart (faster than `lora_only`)
- `none`: legacy full-checkpoint flow with vLLM reloads
## Prerequisites
1. Python 3.10+
2. CUDA-capable PyTorch environment for GPU training
3. Atropos API server available (`run-api`)
4. An environment process producing trajectories (for example GSM8K server)
## Installation
From repository root:
```bash
pip install -e ".[example_trainer]"
```
Optional (all extras):
```bash
pip install -e ".[all]"
```
## CLI Entry Points
After install, you can use either module invocation or script entrypoints:
- `python -m example_trainer.grpo` or `atropos-grpo`
- `python -m example_trainer.run` or `atropos-grpo-run`
## Minimal End-to-End Startup
### 1) Start Atropos API
```bash
run-api --port 8002
```
### 2) Start an environment
```bash
python environments/gsm8k_server.py serve \
--env.rollout_server_url "http://localhost:8002" \
--openai.server_type vllm \
--openai.base_url "http://localhost:9001/v1" \
--openai.api_key "dummy"
```
### 3) Start vLLM server (shared-weights example)
```bash
VLLM_ENABLE_SHARED_WEIGHTS=1 LOGDIR=/tmp/grpo_training \
python -m example_trainer.vllm_api_server \
--model Qwen/Qwen3-1.7B-Base \
--port 9001 \
--gpu-memory-utilization 0.45 \
--enforce-eager
```
### 4) Start trainer
```bash
atropos-grpo \
--model-name Qwen/Qwen3-1.7B-Base \
--weight-bridge-mode shared_vllm \
--vllm-port 9001 \
--vllm-config-path /tmp/grpo_training/vllm_bridge_config.json \
--atropos-url "http://localhost:8002" \
--batch-size 1 \
--gradient-accumulation-steps 64 \
--warmup-steps 5 \
--training-steps 30 \
--clip-eps 0.2
```
## Objective Notes
- GRPO uses rollout `inference_logprobs` for importance-ratio computation.
- The trainer currently uses clipped importance-ratio updates without a separate frozen-reference-model KL term.
## Outputs
- Trainer logs to stdout (and optional W&B if enabled)
- Checkpoints under `--save-path`
- Mode-specific logs/checkpoints when using matrix/orchestration scripts
## Troubleshooting
- If vLLM health checks time out, inspect `vllm.log`, `trainer.log`, and `env.log`.
- If targeted shared-layer runs lose gradients, ensure non-reentrant checkpointing is enabled in shared mode.
- If environment workers time out at 600s, reduce env concurrency (`--env.max_num_workers_per_node`) and batch pressure.