mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
2.8 KiB
2.8 KiB
GRPO Example Trainer
This guide explains how to run the example_trainer integration with Atropos using GRPO.
The trainer is a reference implementation for end-to-end wiring (environment -> run-api -> rollout server -> optimizer), with multiple synchronization modes with vLLM.
Supported Modes
shared_vllm: single-copy training via CUDA IPC (trainer updates shared vLLM tensors in place)lora_only: LoRA adapter training with HTTP hot-swap (slow due to eager mode)lora_restart: LoRA adapter training with periodic vLLM restart (faster thanlora_only)none: legacy full-checkpoint flow with vLLM reloads
Prerequisites
- Python 3.10+
- CUDA-capable PyTorch environment for GPU training
- Atropos API server available (
run-api) - An environment process producing trajectories (for example GSM8K server)
Installation
From repository root:
pip install -e ".[example_trainer]"
Optional (all extras):
pip install -e ".[all]"
CLI Entry Points
After install, you can use either module invocation or script entrypoints:
python -m example_trainer.grpooratropos-grpopython -m example_trainer.runoratropos-grpo-run
Minimal End-to-End Startup
1) Start Atropos API
run-api --port 8002
2) Start an environment
python environments/gsm8k_server.py serve \
--env.rollout_server_url "http://localhost:8002" \
--openai.server_type vllm \
--openai.base_url "http://localhost:9001/v1" \
--openai.api_key "dummy"
3) Start vLLM server (shared-weights example)
VLLM_ENABLE_SHARED_WEIGHTS=1 LOGDIR=/tmp/grpo_training \
python -m example_trainer.vllm_api_server \
--model Qwen/Qwen3-1.7B-Base \
--port 9001 \
--gpu-memory-utilization 0.45 \
--enforce-eager
4) Start trainer
atropos-grpo \
--model-name Qwen/Qwen3-1.7B-Base \
--weight-bridge-mode shared_vllm \
--vllm-port 9001 \
--vllm-config-path /tmp/grpo_training/vllm_bridge_config.json \
--atropos-url "http://localhost:8002" \
--batch-size 1 \
--gradient-accumulation-steps 64 \
--warmup-steps 5 \
--training-steps 30 \
--clip-eps 0.2
Objective Notes
- GRPO uses rollout
inference_logprobsfor importance-ratio computation. - The trainer currently uses clipped importance-ratio updates without a separate frozen-reference-model KL term.
Outputs
- Trainer logs to stdout (and optional W&B if enabled)
- Checkpoints under
--save-path - Mode-specific logs/checkpoints when using matrix/orchestration scripts
Troubleshooting
- If vLLM health checks time out, inspect
vllm.log,trainer.log, andenv.log. - If targeted shared-layer runs lose gradients, ensure non-reentrant checkpointing is enabled in shared mode.
- If environment workers time out at 600s, reduce env concurrency (
--env.max_num_workers_per_node) and batch pressure.