# GRPO Trainer A modular training framework for fine-tuning language models with **Group Relative Policy Optimization (GRPO)**, designed to work with the Atropos environment system. ## 📁 Module Structure ``` example_trainer/ ├── grpo.py # CLI entry point (dispatches to trainers) ├── config.py # TrainingConfig dataclass ├── api.py # Atropos API communication ├── data.py # Data fetching & preprocessing ├── model.py # Model loading & CUDA IPC shared memory ├── training.py # Loss computation & training step ├── checkpointing.py # Save models & LoRA adapters ├── vllm_manager.py # vLLM process management ├── trainers.py # Training mode implementations ├── cli.py # CLI argument parsing ├── vllm_api_server.py # Custom vLLM server with IPC support ├── vllm_patching/ # B200/Blackwell GPU patches │ └── patched_gpu_runner.py └── scripts/ # Helper scripts ├── run_comparison.sh ├── run_concurrent_tests.sh ├── test_lora_mode.sh └── test_single_copy_mode.sh ``` --- ## 🔄 Full System Architecture The Atropos training system consists of 4 components that must run together: ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ ATROPOS TRAINING SYSTEM │ └─────────────────────────────────────────────────────────────────────────────┘ ┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ vLLM │◄────►│ Environment │─────►│ run-api │ │ Server │ │ (gsm8k_server) │ │ (Trajectory │ │ (Inference)│ │ (Process Env) │ │ Handler API) │ └─────────────┘ └──────────────────┘ └────────┬────────┘ ▲ │ │ │ │ ┌───────────────────────────────────┘ │ │ │ ▼ │ ┌─────────────┐ └────────│ GRPO │ │ Trainer │ │ (grpo.py) │ └─────────────┘ Data Flow: 1. run-api : Central API that receives trajectories and serves batches 2. Environment : Generates prompts, calls vLLM, scores responses → sends to run-api 3. Trainer : Fetches batches from run-api → trains model → updates weights 4. vLLM : Serves inference for environment (and gets weight updates) ``` ### Components Explained | Component | Command | Port | Purpose | |-----------|---------|------|---------| | **run-api** | `run-api` | 8000 | Central trajectory handler API | | **Environment** | `gsm8k_server.py serve` | (internal) | Generates rollouts, scores them | | **vLLM** | `vllm_api_server.py` | 9001 | Model inference | | **Trainer** | `grpo.py` | (client) | Fetches batches, trains model | --- ## 🎯 Three Training Modes | Mode | Description | vLLM Setup | Best For | |------|-------------|------------|----------| | **Legacy** (`none`) | Trainer manages vLLM, restarts with new checkpoints | Auto-managed | Simple setup, different GPUs | | **Shared vLLM** (`shared_vllm`) | Single-copy mode via CUDA IPC - no model duplication! | External, `VLLM_ENABLE_SHARED_WEIGHTS=1` | Same GPU, max efficiency | | **LoRA** (`lora_only`) | Train adapters only, hot-swap in vLLM | External, `--enable-lora` | Fast training, small checkpoints | --- ## 🚀 Quick Start ### Prerequisites ```bash # Install dependencies pip install torch transformers peft vllm wandb requests tenacity pydantic # Set environment variables export LOGDIR=/tmp/atropos_test export MODEL=Qwen/Qwen2.5-3B-Instruct mkdir -p $LOGDIR ``` --- ## 📖 Detailed Usage for Each Mode ### Mode 1: Legacy (Checkpoint + Restart) The simplest mode. Trainer manages vLLM internally. ```bash # Terminal 1: Start the central API server (handles trajectories) run-api --port 8000 # Terminal 2: Start the environment server (generates rollouts) python -u environments/gsm8k_server.py serve \ --env.tokenizer_name $MODEL \ --env.use_wandb=False \ --openai.model_name $MODEL \ --openai.base_url http://localhost:9001/v1 \ --openai.server_type vllm # Terminal 3: Run training (trainer will launch its own vLLM) CUDA_VISIBLE_DEVICES=0 python -m example_trainer.grpo \ --model-name $MODEL \ --weight-bridge-mode none \ --vllm-port 9001 \ --atropos-url http://localhost:8000 \ --training-steps 20 \ --batch-size 2 \ --save-path $LOGDIR/checkpoints_legacy \ --benchmark ``` ### Mode 2: Shared vLLM (Single-Copy CUDA IPC) Zero model duplication - trainer and vLLM share the exact same GPU memory! ```bash # Terminal 1: Start the central API server run-api --port 8000 # Terminal 2: Start vLLM with shared weights enabled # IMPORTANT: --enforce-eager is REQUIRED to disable CUDA graphs # Without it, weight updates won't be visible to inference! VLLM_ENABLE_SHARED_WEIGHTS=1 LOGDIR=$LOGDIR \ CUDA_VISIBLE_DEVICES=0 python example_trainer/vllm_api_server.py \ --model $MODEL \ --port 9001 \ --gpu-memory-utilization 0.45 \ --enforce-eager # Terminal 3: Start the environment server python -u environments/gsm8k_server.py serve \ --env.tokenizer_name $MODEL \ --env.use_wandb=False \ --openai.model_name $MODEL \ --openai.base_url http://localhost:9001/v1 \ --openai.server_type vllm # Terminal 4: Run training (attaches to vLLM's tensors) CUDA_VISIBLE_DEVICES=0 python -m example_trainer.grpo \ --model-name $MODEL \ --weight-bridge-mode shared_vllm \ --vllm-port 9001 \ --vllm-config-path $LOGDIR/vllm_bridge_config.json \ --atropos-url http://localhost:8000 \ --training-steps 20 \ --batch-size 2 \ --save-path $LOGDIR/checkpoints_shared \ --benchmark ``` ### Mode 3: LoRA (Adapter Training) Fast training with hot-swappable adapters. ```bash # Terminal 1: Start the central API server run-api --port 8000 # Terminal 2: Start vLLM with LoRA support CUDA_VISIBLE_DEVICES=0 python example_trainer/vllm_api_server.py \ --model $MODEL \ --port 9001 \ --gpu-memory-utilization 0.45 \ --enable-lora \ --max-lora-rank 32 \ --enforce-eager # Terminal 3: Start the environment server python -u environments/gsm8k_server.py serve \ --env.tokenizer_name $MODEL \ --env.use_wandb=False \ --openai.model_name $MODEL \ --openai.base_url http://localhost:9001/v1 \ --openai.server_type vllm # Terminal 4: Run LoRA training CUDA_VISIBLE_DEVICES=1 python -m example_trainer.grpo \ --model-name $MODEL \ --weight-bridge-mode lora_only \ --vllm-port 9001 \ --atropos-url http://localhost:8000 \ --lora-r 16 \ --lora-alpha 32 \ --training-steps 20 \ --batch-size 2 \ --save-path $LOGDIR/checkpoints_lora \ --benchmark ``` --- ## 🔬 Run All 3 Modes in Parallel (8-GPU Comparison) Use this setup to compare training efficiency across all three modes on a single 8-GPU node. ### GPU & Port Allocation | Mode | GPUs | vLLM Port | API Port | Env Port | |------|------|-----------|----------|----------| | Legacy | 0-1 | 9001 | 8001 | (internal) | | Shared vLLM | 2-3 | 9002 | 8002 | (internal) | | LoRA | 4-5 | 9003 | 8003 | (internal) | ### Quick Start: Use the Comparison Script ```bash cd /path/to/atropos # Run comparison with default 50 steps ./example_trainer/scripts/run_comparison.sh # Or specify steps ./example_trainer/scripts/run_comparison.sh 100 ``` ### Manual Parallel Execution If you prefer to run each mode manually in separate terminal sessions: ```bash # Setup export MODEL="Qwen/Qwen2.5-3B-Instruct" export LOGDIR=/tmp/atropos_test mkdir -p $LOGDIR # ============================================= # LEGACY MODE (Terminals 1-3) # ============================================= # Terminal 1: API server for legacy run-api --port 8001 # Terminal 2: Environment server python -u environments/gsm8k_server.py serve \ --env.tokenizer_name $MODEL \ --env.use_wandb=False \ --env.rollout_server_url http://localhost:8001 \ --openai.model_name $MODEL \ --openai.base_url http://localhost:9001/v1 \ --openai.server_type vllm \ --slurm false # Terminal 3: Trainer (manages its own vLLM) CUDA_VISIBLE_DEVICES=0,1 python -m example_trainer.grpo \ --model-name $MODEL \ --weight-bridge-mode none \ --vllm-port 9001 \ --atropos-url http://localhost:8001 \ --training-steps 50 \ --save-path $LOGDIR/checkpoints_legacy \ --benchmark # ============================================= # SHARED VLLM MODE (Terminals 4-7) # ============================================= # Terminal 4: API server for shared run-api --port 8002 # Terminal 5: vLLM server with shared weights VLLM_ENABLE_SHARED_WEIGHTS=1 LOGDIR=$LOGDIR \ CUDA_VISIBLE_DEVICES=2 python example_trainer/vllm_api_server.py \ --model $MODEL --port 9002 --gpu-memory-utilization 0.45 # Terminal 6: Environment server python -u environments/gsm8k_server.py serve \ --env.tokenizer_name $MODEL \ --env.use_wandb=False \ --env.rollout_server_url http://localhost:8002 \ --openai.model_name $MODEL \ --openai.base_url http://localhost:9002/v1 \ --openai.server_type vllm \ --slurm false # Terminal 7: Trainer (attaches to vLLM) CUDA_VISIBLE_DEVICES=2 python -m example_trainer.grpo \ --model-name $MODEL \ --weight-bridge-mode shared_vllm \ --vllm-port 9002 \ --vllm-config-path $LOGDIR/vllm_bridge_config.json \ --atropos-url http://localhost:8002 \ --training-steps 50 \ --save-path $LOGDIR/checkpoints_shared \ --benchmark # ============================================= # LORA MODE (Terminals 8-11) # ============================================= # Terminal 8: API server for LoRA run-api --port 8003 # Terminal 9: vLLM server with LoRA CUDA_VISIBLE_DEVICES=4 python example_trainer/vllm_api_server.py \ --model $MODEL --port 9003 --gpu-memory-utilization 0.45 \ --enable-lora --max-lora-rank 32 --enforce-eager # Terminal 10: Environment server python -u environments/gsm8k_server.py serve \ --env.tokenizer_name $MODEL \ --env.use_wandb=False \ --env.rollout_server_url http://localhost:8003 \ --openai.model_name $MODEL \ --openai.base_url http://localhost:9003/v1 \ --openai.server_type vllm \ --slurm false # Terminal 11: Trainer CUDA_VISIBLE_DEVICES=5 python -m example_trainer.grpo \ --model-name $MODEL \ --weight-bridge-mode lora_only \ --vllm-port 9003 \ --atropos-url http://localhost:8003 \ --lora-r 16 --lora-alpha 32 \ --training-steps 50 \ --save-path $LOGDIR/checkpoints_lora \ --benchmark ``` --- ## 📊 Understanding the Benchmark Output Each trainer outputs a benchmark summary at the end: ``` ====================================================================== BENCHMARK SUMMARY (shared_vllm) ====================================================================== Total training time: 168.65s (2.81 min) Total steps: 50 TIMING BREAKDOWN: Avg step time: 11.95s Total step time: 59.76s Avg sync time: 0.00s (x0 syncs) <-- No syncs in shared mode! Total sync time: 0.00s Avg data fetch time: 10.90s Total data fetch time: 54.52s MEMORY: Peak GPU memory: 31.44 GB Avg GPU memory: 18.88 GB ====================================================================== ``` **Key metrics to compare:** | Metric | Legacy | Shared vLLM | LoRA | |--------|--------|-------------|------| | **Sync time** | High (restart vLLM) | 0 (in-place update) | Low (adapter swap) | | **GPU memory** | 2x model | 1x model | 1x + adapter | | **Step time** | ~10-15s | ~10-15s | ~5-10s | | **Checkpoint size** | ~6GB | ~6GB | ~50MB | --- ## 🛠 CLI Reference ```bash python -m example_trainer.grpo --help ``` ### Key Arguments | Argument | Default | Description | |----------|---------|-------------| | `--model-name` | (required) | HuggingFace model ID | | `--weight-bridge-mode` | `none` | `none`, `shared_vllm`, or `lora_only` | | `--training-steps` | 10 | Number of training steps | | `--batch-size` | 2 | Batch size | | `--vllm-port` | 9001 | vLLM server port | | `--atropos-url` | `http://localhost:8000` | Atropos API server URL | | `--save-path` | `trained_model_checkpoints` | Checkpoint directory | | `--benchmark` | false | Show timing stats | | `--debug-loading` | false | Verbose model loading output | ### LoRA-specific Arguments | Argument | Default | Description | |----------|---------|-------------| | `--lora-r` | 16 | LoRA rank | | `--lora-alpha` | 32 | LoRA alpha (scaling) | | `--lora-dropout` | 0.05 | LoRA dropout | | `--lora-target-modules` | `q_proj v_proj` | Modules to apply LoRA | ### Single-Copy Mode Arguments | Argument | Default | Description | |----------|---------|-------------| | `--single-copy` | false | Enable CUDA IPC mode | | `--vllm-config-path` | auto-detect | Path to `vllm_bridge_config.json` | --- ## 🐛 Troubleshooting ### "Atropos API not reachable" ```bash # Make sure run-api is running run-api --port 8000 ``` ### "vLLM server not running" (LoRA mode) ```bash # LoRA mode requires external vLLM with --enable-lora python example_trainer/vllm_api_server.py \ --model $MODEL --port 9001 --enable-lora --enforce-eager ``` ### "Could not find vllm_bridge_config.json" (Shared mode) ```bash # Make sure vLLM was started with VLLM_ENABLE_SHARED_WEIGHTS=1 and LOGDIR set VLLM_ENABLE_SHARED_WEIGHTS=1 LOGDIR=/tmp/atropos python example_trainer/vllm_api_server.py ... ``` ### "LogProb Alignment: MISMATCH!" in shared_vllm mode If you see `[MISMATCH!]` in the logprob alignment output, inference and training are seeing different weights. This is usually caused by **CUDA graphs**. **Symptom:** `inference_mean` stays constant while `training_mean` changes. The `diff` increases over time. **Fix:** Add `--enforce-eager` when starting vLLM: ```bash VLLM_ENABLE_SHARED_WEIGHTS=1 LOGDIR=$LOGDIR \ python example_trainer/vllm_api_server.py \ --model $MODEL --port 9001 --enforce-eager # <-- REQUIRED! ``` **Why:** CUDA graphs "bake" model weights into compiled graphs at startup. Updates to the underlying tensors are NOT reflected in inference. Using `--enforce-eager` disables CUDA graphs, so vLLM reads from the shared tensors on every forward pass. ### "Triton compilation error" on B200/Blackwell GPUs The patched vLLM server (`vllm_api_server.py`) automatically applies B200 fixes. If using standard vLLM, add `--enforce-eager`. ### Port already in use ```bash # Kill existing processes pkill -f "run-api" pkill -f "vllm_api_server.py" pkill -f "gsm8k_server.py" ``` ### No batches available / trainer hangs ```bash # Ensure the environment server is connected to the correct API and vLLM # Check that vLLM is running and environment can reach it curl http://localhost:9001/health curl http://localhost:8000/info ``` --- ## 📚 Module Documentation ### `config.py` Contains `TrainingConfig` - all training parameters as a Pydantic model. ### `api.py` - `check_atropos_api()` - Wait for run-api server - `register_trainer()` - Register with Atropos - `get_batch()` - Fetch training batch from run-api ### `data.py` - `pad_data_to_good_offset()` - Pad sequences to GPU-friendly lengths - `get_data()` - Fetch and preprocess batches ### `model.py` - `load_model_and_tokenizer()` - Load model based on mode - `_attach_to_vllm_shared_tensors()` - CUDA IPC attachment - `_create_vllm_to_hf_mapping()` - Handle QKV/Gate-Up fusion ### `training.py` - `compute_grpo_loss()` - GRPO loss computation - `run_training_step()` - Single step with gradient accumulation - `log_metrics()` - Console and WandB logging - `finalize_training()` - Cleanup and summary ### `checkpointing.py` - `save_checkpoint()` - Save full model - `save_lora_checkpoint()` - Save LoRA adapter only ### `vllm_manager.py` - `launch_vllm_server()` - Start vLLM process - `terminate_vllm_process()` - Stop vLLM - `hotswap_lora_adapter()` - Hot-swap LoRA in vLLM ### `trainers.py` - `train_legacy()` - Checkpoint + restart mode - `train_shared_vllm()` - Single-copy CUDA IPC mode - `train_lora()` - Adapter training mode ### `cli.py` - `parse_args()` - Argparse setup - `config_from_args()` - Convert args to TrainingConfig --- ## 📝 License MIT License