mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
779 lines
45 KiB
Markdown
779 lines
45 KiB
Markdown
# GRPO Example Trainer
|
||
|
||
This directory contains an example script (`grpo.py`) demonstrating how to integrate a custom training loop with the Atropos API for reinforcement learning using the GRPO (Group Relative Policy Optimization) algorithm.
|
||
|
||
## Training Modes
|
||
|
||
The trainer supports three weight synchronization modes:
|
||
|
||
| Mode | Description | Sync Latency | Best For |
|
||
|------|-------------|--------------|----------|
|
||
| **Legacy** (`none`) | Save checkpoints, restart vLLM | ~30-60 seconds | Simple setups, debugging |
|
||
| **Single-Copy** (`shared_vllm`) | Direct CUDA IPC - ONE model copy! | 0 ms | Production, memory efficiency |
|
||
| **LoRA** (`lora_only`) | Train adapters, hot-swap | ~1-5 seconds | Memory-constrained, fast iteration |
|
||
|
||
---
|
||
|
||
## Quick Start with GSM8k (Single-Copy Mode)
|
||
|
||
This is the **recommended** production setup for maximum training throughput and memory efficiency.
|
||
|
||
### Prerequisites
|
||
|
||
```bash
|
||
# Install dependencies
|
||
pip install -r example_trainer/requirements.txt
|
||
|
||
# Install GSM8k environment dependencies
|
||
pip install datasets latex2sympy2_extended math_verify
|
||
```
|
||
|
||
### Architecture Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ SINGLE-COPY TRAINING ARCHITECTURE │
|
||
│ │
|
||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────────┐ │
|
||
│ │ GSM8k Env │───▶│ Atropos API │◀───│ GRPO Trainer │ │
|
||
│ │ (problems) │ │ (batching) │ │ - Attached to vLLM's tensors │ │
|
||
│ └─────────────┘ └─────────────┘ │ - optimizer.step() updates both │ │
|
||
│ │ └─────────────────────────────────┘ │
|
||
│ │ │ │
|
||
│ │ │ CUDA IPC │
|
||
│ │ │ (same memory!) │
|
||
│ ▼ ▼ │
|
||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ vLLM Inference Server (GPU 0) │ │
|
||
│ │ - Model weights in GPU memory │ │
|
||
│ │ - Trainer sees same tensors via IPC │ │
|
||
│ │ - Generates rollouts for scoring │ │
|
||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### How Single-Copy Mode Works
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────┐
|
||
│ SAME GPU(s) │
|
||
│ │
|
||
│ ┌──────────────────────────────────────────────────┐ │
|
||
│ │ SHARED MODEL TENSORS │ │
|
||
│ │ (only ONE copy in GPU memory!) │ │
|
||
│ └──────────────────────────────────────────────────┘ │
|
||
│ ▲ ▲ │
|
||
│ │ Reads/Writes │ Reads │
|
||
│ ┌────────┴───────┐ ┌────────┴───────┐ │
|
||
│ │ Trainer │ │ vLLM │ │
|
||
│ │ (gradients) │ │ (inference) │ │
|
||
│ └────────────────┘ └────────────────┘ │
|
||
│ │ │
|
||
│ │ optimizer.step() │
|
||
│ │ (updates shared tensors in-place) │
|
||
│ ▼ │
|
||
│ vLLM immediately sees new weights! │
|
||
└────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
- **Memory**: 1x model size (truly shared via CUDA IPC!)
|
||
- **Sync Latency**: 0ms (same memory, no copy needed)
|
||
- **Requirement**: Trainer and vLLM on SAME GPU(s)
|
||
|
||
---
|
||
|
||
### Step-by-Step Guide
|
||
|
||
**IMPORTANT: GPU Allocation**
|
||
- vLLM and Trainer run on the SAME GPU(s)
|
||
- Use `tensor-parallel-size 1` for single-copy mode (TP>1 not yet supported)
|
||
|
||
---
|
||
|
||
#### Step 1: Kill Any Existing Processes
|
||
|
||
```bash
|
||
pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3
|
||
```
|
||
|
||
#### Step 2: Setup Directory
|
||
|
||
```bash
|
||
cd ~/atropos_stuff/atropos
|
||
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log
|
||
```
|
||
|
||
#### Step 3: Set Environment Variables
|
||
|
||
```bash
|
||
export VLLM_ENABLE_SHARED_WEIGHTS=1
|
||
export VLLM_SKIP_WEIGHT_DAEMON=1
|
||
export NUM_INFERENCE_NODES=0
|
||
export LOGDIR=.
|
||
```
|
||
|
||
#### Step 4: Start vLLM Server
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
|
||
--model Qwen/Qwen2.5-14B-Instruct \
|
||
--tensor-parallel-size 1 \
|
||
--port 9001 \
|
||
> vllm.log 2>&1 &
|
||
echo "vLLM starting on GPU 0..."
|
||
```
|
||
|
||
#### Step 5: Wait for vLLM to Load
|
||
|
||
```bash
|
||
tail -f vllm.log
|
||
```
|
||
|
||
Wait until you see: `Uvicorn running on http://0.0.0.0:9001`
|
||
|
||
Then press **Ctrl+C** to stop tailing.
|
||
|
||
#### Step 6: Verify IPC Handles Exported
|
||
|
||
```bash
|
||
grep -E "IPC|Exported|single_copy" vllm.log
|
||
```
|
||
|
||
You should see:
|
||
```
|
||
[vLLM Patch] Exported X IPC handles for single-copy mode
|
||
[vLLM Patch] ✓ Exported 339 params to vllm_bridge_config.json
|
||
```
|
||
|
||
#### Step 7: Start GSM8K Environment
|
||
|
||
```bash
|
||
python environments/gsm8k_server.py serve \
|
||
--slurm False \
|
||
--openai.model_name Qwen/Qwen2.5-14B-Instruct \
|
||
--openai.base_url http://localhost:9001/v1 \
|
||
--openai.server_type vllm \
|
||
--openai.api_key x \
|
||
--env.tokenizer_name Qwen/Qwen2.5-14B-Instruct \
|
||
--env.use_wandb False \
|
||
> gsm8k.log 2>&1 &
|
||
echo "GSM8K environment started"
|
||
sleep 10
|
||
```
|
||
|
||
#### Step 8: Start Trainer (Same GPU as vLLM!)
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
|
||
--model-name Qwen/Qwen2.5-14B-Instruct \
|
||
--weight-bridge-mode shared_vllm \
|
||
--training-steps 100 \
|
||
2>&1 | tee trainer.log
|
||
```
|
||
|
||
#### Step 9: Monitor Training
|
||
|
||
```bash
|
||
tail -f trainer.log
|
||
```
|
||
|
||
You should see:
|
||
```
|
||
[Setup] ✓ Attached 195 tensors to vLLM's shared memory
|
||
[Setup] ✓ Single-copy mode active - using vLLM's tensors directly!
|
||
[2/2] Starting training for 100 steps
|
||
Step 1/100
|
||
[SINGLE-COPY] Weights updated in-place - step 1
|
||
```
|
||
|
||
---
|
||
|
||
### Quick Copy-Paste (All-in-One)
|
||
|
||
```bash
|
||
# Kill everything and setup
|
||
pkill -9 -u $USER -f "vllm|grpo|python" 2>/dev/null; sleep 3
|
||
cd ~/atropos_stuff/atropos
|
||
rm -f vllm_bridge_config.json *.log
|
||
|
||
# Environment variables
|
||
export VLLM_ENABLE_SHARED_WEIGHTS=1 VLLM_SKIP_WEIGHT_DAEMON=1 NUM_INFERENCE_NODES=0 LOGDIR=.
|
||
|
||
# Start vLLM
|
||
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
|
||
--model Qwen/Qwen2.5-14B-Instruct --tensor-parallel-size 1 --port 9001 > vllm.log 2>&1 &
|
||
echo "Waiting 90s for vLLM..."; sleep 90
|
||
|
||
# Start GSM8k environment
|
||
python environments/gsm8k_server.py serve --slurm False \
|
||
--openai.model_name Qwen/Qwen2.5-14B-Instruct \
|
||
--openai.base_url http://localhost:9001/v1 \
|
||
--openai.server_type vllm --openai.api_key x \
|
||
--env.tokenizer_name Qwen/Qwen2.5-14B-Instruct \
|
||
--env.use_wandb False > gsm8k.log 2>&1 &
|
||
sleep 10
|
||
|
||
# Start trainer (same GPU!)
|
||
CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
|
||
--model-name Qwen/Qwen2.5-14B-Instruct \
|
||
--weight-bridge-mode shared_vllm \
|
||
--training-steps 100 \
|
||
2>&1 | tee trainer.log
|
||
```
|
||
|
||
---
|
||
|
||
## How Each Mode Works (Data Flow Diagrams)
|
||
|
||
### Single-Copy Mode (`--weight-bridge-mode shared_vllm`) ⭐ RECOMMENDED
|
||
|
||
**The Magic**: Trainer and vLLM share the EXACT SAME GPU memory via CUDA IPC.
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ SINGLE-COPY MODE - COMPLETE DATA FLOW │
|
||
│ │
|
||
│ STEP 1: GSM8k sends problem │
|
||
│ ┌──────────────────┐ │
|
||
│ │ GSM8k Server │──── "What is 15 × 7?" ────▶┌──────────────────┐ │
|
||
│ │ (Environment) │ │ Atropos API │ │
|
||
│ └──────────────────┘ │ (Batching) │ │
|
||
│ └────────┬─────────┘ │
|
||
│ │ │
|
||
│ STEP 2: Atropos forwards to vLLM │ │
|
||
│ ▼ │
|
||
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ GPU MEMORY │ │
|
||
│ │ │ │
|
||
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
|
||
│ │ │ MODEL WEIGHTS (ONE COPY - SHARED!) │ │ │
|
||
│ │ │ │ │ │
|
||
│ │ │ embed_tokens.weight, layers.*.qkv_proj, ..., lm_head.weight │ │ │
|
||
│ │ │ (address: 0x7f8a12340000) │ │ │
|
||
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
|
||
│ │ ▲ ▲ │ │
|
||
│ │ │ STEP 3: READ │ STEP 6: WRITE │ │
|
||
│ │ │ (generate tokens) │ (optimizer.step) │ │
|
||
│ │ ┌────────┴────────┐ ┌─────────┴─────────┐ │ │
|
||
│ │ │ vLLM Server │ │ Trainer │ │ │
|
||
│ │ │ │ │ (grpo.py) │ │ │
|
||
│ │ │ Generates: │ │ │ │ │
|
||
│ │ │ "15 × 7 = 105" │ │ STEP 5: Compute │ │ │
|
||
│ │ │ │ │ GRPO loss & │ │ │
|
||
│ │ └────────┬────────┘ │ gradients │ │ │
|
||
│ │ │ └─────────▲─────────┘ │ │
|
||
│ └───────────┼──────────────────────────────────────────────┼────────────────────┘ │
|
||
│ │ │ │
|
||
│ │ STEP 4: Return completion │ │
|
||
│ ▼ │ │
|
||
│ ┌──────────────────┐ │ │
|
||
│ │ GSM8k Server │───────────────────────────────────────┘ │
|
||
│ │ (Scoring) │ │
|
||
│ │ │ Scores: "15 × 7 = 105" ✓ reward=1.0 │
|
||
│ │ │ "15 × 7 = 100" ✗ reward=0.0 │
|
||
│ └──────────────────┘ │
|
||
│ │
|
||
│ STEP 7: IMMEDIATE UPDATE │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ After optimizer.step(), vLLM's NEXT inference uses the NEW weights! │ │
|
||
│ │ NO SYNC NEEDED - it's the same memory! │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Key Points:**
|
||
- ✅ ONE copy of weights in GPU memory
|
||
- ✅ 0ms sync latency (same memory!)
|
||
- ✅ Memory efficient (~1x model size)
|
||
- ⚠️ Requires same GPU for trainer and vLLM
|
||
|
||
---
|
||
|
||
### LoRA Mode (`--weight-bridge-mode lora_only`)
|
||
|
||
**The Idea**: Freeze base model, only train small adapter layers. Hot-swap adapters into vLLM.
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ LORA MODE - COMPLETE DATA FLOW │
|
||
│ │
|
||
│ STEP 1: GSM8k sends problem │
|
||
│ ┌──────────────────┐ │
|
||
│ │ GSM8k Server │──── "What is 15 × 7?" ────▶┌──────────────────┐ │
|
||
│ │ (Environment) │ │ Atropos API │ │
|
||
│ └──────────────────┘ └────────┬─────────┘ │
|
||
│ │ │
|
||
│ STEP 2: Forward to vLLM ▼ │
|
||
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ vLLM GPU MEMORY │ │
|
||
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
|
||
│ │ │ BASE MODEL (frozen, ~6GB) │ │ │
|
||
│ │ │ + LORA ADAPTER A (current, ~50MB) │ │ │
|
||
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
|
||
│ │ │ │ │
|
||
│ │ │ STEP 3: Inference with base + adapter A │ │
|
||
│ │ ▼ │ │
|
||
│ │ ┌────────────────────┐ │ │
|
||
│ │ │ vLLM Server │ ──── "15 × 7 = 105" ────▶ │ │
|
||
│ │ └────────────────────┘ │ │
|
||
│ └──────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ TRAINER GPU MEMORY (separate!) │ │
|
||
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
|
||
│ │ │ BASE MODEL (frozen, ~6GB) │ │ │
|
||
│ │ │ + LORA ADAPTER B (training, ~50MB) ◀── gradients flow here only! │ │ │
|
||
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
|
||
│ │ │ │ │
|
||
│ │ │ STEP 4-5: Receive rollout, compute loss, update adapter B │ │
|
||
│ │ ▼ │ │
|
||
│ │ ┌────────────────────┐ │ │
|
||
│ │ │ Trainer │ │ │
|
||
│ │ │ (grpo.py) │ │ │
|
||
│ │ └────────┬───────────┘ │ │
|
||
│ └───────────┼──────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ │ STEP 6: Every N steps, save adapter B to disk │
|
||
│ ▼ │
|
||
│ ┌──────────────────┐ STEP 7: POST /lora/load ┌──────────────────┐ │
|
||
│ │ adapter_step_N/ │ ─────────────────────────────────▶│ vLLM Server │ │
|
||
│ │ (50MB on disk) │ │ Swaps A → B │ │
|
||
│ └──────────────────┘ └──────────────────┘ │
|
||
│ │
|
||
│ STEP 8: Next inference uses NEW adapter B │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ Sync latency: 1-5 seconds (save to disk + HTTP load) │ │
|
||
│ │ Memory: 2x base model + adapters │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Key Points:**
|
||
- ✅ Small adapter files (~50MB vs ~28GB)
|
||
- ✅ Works on separate GPUs
|
||
- ✅ Easy to switch between adapters
|
||
- ⚠️ 1-5 second sync latency
|
||
- ⚠️ 2x base model memory (trainer + vLLM)
|
||
|
||
---
|
||
|
||
### Legacy Mode (`--weight-bridge-mode none`)
|
||
|
||
**The Simple Approach**: Save full checkpoints, restart vLLM to load new weights.
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ LEGACY MODE - COMPLETE DATA FLOW │
|
||
│ │
|
||
│ STEP 1: GSM8k sends problem │
|
||
│ ┌──────────────────┐ │
|
||
│ │ GSM8k Server │──── "What is 15 × 7?" ────▶┌──────────────────┐ │
|
||
│ │ (Environment) │ │ Atropos API │ │
|
||
│ └──────────────────┘ └────────┬─────────┘ │
|
||
│ │ │
|
||
│ STEP 2: Forward to vLLM ▼ │
|
||
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ vLLM GPU MEMORY │ │
|
||
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
|
||
│ │ │ FULL MODEL - Version 1 (~28GB) │ │ │
|
||
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
|
||
│ │ │ │ │
|
||
│ │ │ STEP 3: Inference │ │
|
||
│ │ ▼ │ │
|
||
│ │ ┌────────────────────┐ │ │
|
||
│ │ │ vLLM Server │ ──── "15 × 7 = 105" ────▶ │ │
|
||
│ │ └────────────────────┘ │ │
|
||
│ └──────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ TRAINER GPU MEMORY (separate!) │ │
|
||
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
|
||
│ │ │ FULL MODEL - Version 2 (~28GB + gradients + optimizer) │ │ │
|
||
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
|
||
│ │ │ │ │
|
||
│ │ │ STEP 4-5: Receive rollout, compute loss, update weights │ │
|
||
│ │ ▼ │ │
|
||
│ │ ┌────────────────────┐ │ │
|
||
│ │ │ Trainer │ │ │
|
||
│ │ │ (grpo.py) │ │ │
|
||
│ │ └────────┬───────────┘ │ │
|
||
│ └───────────┼──────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ │ STEP 6: Every N steps, save FULL checkpoint to disk (~28GB) │
|
||
│ ▼ │
|
||
│ ┌──────────────────┐ │
|
||
│ │ checkpoint/ │ │
|
||
│ │ step_N/ │ (28GB on disk!) │
|
||
│ │ - model.safetensors │
|
||
│ │ - config.json │
|
||
│ └────────┬─────────┘ │
|
||
│ │ │
|
||
│ │ STEP 7: RESTART vLLM with new checkpoint │
|
||
│ │ │
|
||
│ │ ┌─────────────────────────────────────────────────────────────────┐ │
|
||
│ │ │ 1. Kill vLLM process │ │
|
||
│ │ │ 2. Start new vLLM with --model checkpoint/step_N/ │ │
|
||
│ │ │ 3. Wait for model to load (~30-60 seconds) │ │
|
||
│ │ │ 4. Resume training │ │
|
||
│ │ └─────────────────────────────────────────────────────────────────┘ │
|
||
│ ▼ │
|
||
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ vLLM GPU MEMORY (restarted) │ │
|
||
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
|
||
│ │ │ FULL MODEL - Version 2 (loaded from checkpoint) │ │ │
|
||
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
|
||
│ └──────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
│ STEP 8: Next inference uses updated model │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ Sync latency: 30-60 seconds (save + restart + reload) │ │
|
||
│ │ Memory: 2x full model │ │
|
||
│ │ Disk: 28GB per checkpoint │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Key Points:**
|
||
- ✅ Simple to understand
|
||
- ✅ Works on any setup
|
||
- ✅ Good for debugging
|
||
- ⚠️ 30-60 second sync latency
|
||
- ⚠️ 2x GPU memory (trainer + vLLM)
|
||
- ⚠️ Large checkpoint files (~28GB each)
|
||
|
||
---
|
||
|
||
## Mode Comparison Summary
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────────────────────┐
|
||
│ MODE COMPARISON AT A GLANCE │
|
||
├────────────────┬───────────────┬────────────────┬────────────────────────────────┤
|
||
│ │ SINGLE-COPY │ LORA │ LEGACY │
|
||
├────────────────┼───────────────┼────────────────┼────────────────────────────────┤
|
||
│ Sync Latency │ 0 ms ⚡ │ 1-5 sec │ 30-60 sec │
|
||
│ GPU Memory │ 1x model │ 2x model │ 2x model │
|
||
│ Disk Space │ 28GB/ckpt │ 50MB/adapter │ 28GB/ckpt │
|
||
│ Complexity │ Medium │ Medium │ Simple │
|
||
│ Same GPU? │ Required ⚠️ │ Optional │ Optional │
|
||
│ Best For │ Production │ Experiments │ Debugging │
|
||
└────────────────┴───────────────┴────────────────┴────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Alternative Mode Commands
|
||
|
||
### Legacy Mode (Checkpoint + Restart)
|
||
|
||
For simple setups or debugging. Saves checkpoints and restarts vLLM to load new weights.
|
||
|
||
```bash
|
||
python example_trainer/grpo.py \
|
||
--model-name Qwen/Qwen2.5-3B-Instruct \
|
||
--weight-bridge-mode none \
|
||
--training-steps 100 \
|
||
--vllm-restart-interval 10 \
|
||
--batch-size 2 \
|
||
--lr 1e-5
|
||
```
|
||
|
||
### LoRA Mode (Adapter Training)
|
||
|
||
Trains only adapter weights. Small checkpoints, lower memory.
|
||
|
||
```bash
|
||
python example_trainer/grpo.py \
|
||
--model-name Qwen/Qwen2.5-3B-Instruct \
|
||
--weight-bridge-mode lora_only \
|
||
--lora-r 16 \
|
||
--lora-alpha 32 \
|
||
--training-steps 100 \
|
||
--batch-size 2 \
|
||
--lr 1e-4
|
||
```
|
||
|
||
---
|
||
|
||
## Configuration Reference
|
||
|
||
### Environment Variables
|
||
|
||
| Variable | Required | Description | Example |
|
||
|----------|----------|-------------|---------|
|
||
| `VLLM_ENABLE_SHARED_WEIGHTS` | Yes (single-copy) | Enable vLLM patching for IPC | `1` |
|
||
| `VLLM_SKIP_WEIGHT_DAEMON` | Yes (single-copy) | Skip NCCL daemon (not needed) | `1` |
|
||
| `NUM_INFERENCE_NODES` | Yes | Number of vLLM nodes (0 = local) | `0` |
|
||
| `LOGDIR` | Recommended | Directory for vllm_bridge_config.json | `.` |
|
||
| `CUDA_VISIBLE_DEVICES` | Recommended | GPU allocation | `0` |
|
||
|
||
### Trainer CLI Options
|
||
|
||
| Option | Default | Description |
|
||
|--------|---------|-------------|
|
||
| `--model-name` | (required) | HuggingFace model ID |
|
||
| `--weight-bridge-mode` | `none` | `none`, `shared_vllm`, or `lora_only` |
|
||
| `--single-copy` | `false` | Enable TRUE single-copy mode via CUDA IPC |
|
||
| `--vllm-config-path` | (auto-detect) | Explicit path to `vllm_bridge_config.json` |
|
||
| `--vllm-port` | `9001` | vLLM server port |
|
||
| `--training-steps` | `10` | Total optimization steps |
|
||
| `--batch-size` | `2` | Micro-batch size |
|
||
| `--lr` | `1e-5` | Learning rate |
|
||
| `--save-path` | `trained_model_checkpoints` | Checkpoint directory |
|
||
|
||
### vLLM Server Options
|
||
|
||
| Option | Description |
|
||
|--------|-------------|
|
||
| `--model` | HuggingFace model ID |
|
||
| `--tensor-parallel-size` | Number of GPUs (use 1 for single-copy) |
|
||
| `--port` | Server port (default: 9001) |
|
||
| `--dtype` | Model dtype (`bfloat16`, `float16`, `auto`) |
|
||
| `--gpu-memory-utilization` | Fraction of GPU memory for KV cache (default: 0.9) |
|
||
|
||
---
|
||
|
||
## The vLLM Bridge Config (vllm_bridge_config.json)
|
||
|
||
The `vllm_bridge_config.json` file is the critical communication mechanism between the vLLM inference server and the GRPO trainer in single-copy mode. Understanding this file is essential for debugging and advanced configurations.
|
||
|
||
### What It Is
|
||
|
||
When you start vLLM with `VLLM_ENABLE_SHARED_WEIGHTS=1`, the patched `GPUModelRunner` exports CUDA IPC (Inter-Process Communication) handles for all model tensors. These handles allow another process (the trainer) to access the exact same GPU memory—no copying required.
|
||
|
||
### Why It's Important
|
||
|
||
1. **True Single-Copy Architecture**: Instead of loading the model twice (once for training, once for inference), both processes share the same tensors in GPU memory.
|
||
|
||
2. **Zero-Latency Weight Updates**: When `optimizer.step()` modifies the weights, vLLM immediately sees the changes—no serialization, no network transfer, no disk I/O.
|
||
|
||
3. **Memory Efficiency**: For a 7B model (~14GB in bf16), you save ~14GB of GPU memory compared to having two separate copies.
|
||
|
||
### File Location
|
||
|
||
The trainer searches for `vllm_bridge_config.json` in this order:
|
||
|
||
1. **Explicit path** (if `--vllm-config-path` is provided)
|
||
2. **`$LOGDIR/vllm_bridge_config.json`** (if `LOGDIR` env var is set)
|
||
3. **`./vllm_bridge_config.json`** (current directory)
|
||
4. **`/tmp/atropos_bridge/vllm_bridge_config.json`** (default fallback)
|
||
|
||
**Tip**: To avoid "Config not found" errors, always set `LOGDIR`:
|
||
|
||
```bash
|
||
export LOGDIR=.
|
||
```
|
||
|
||
### File Contents
|
||
|
||
The JSON file contains everything needed to reconstruct tensor references in another process:
|
||
|
||
```json
|
||
{
|
||
"model": "Qwen/Qwen2.5-3B-Instruct",
|
||
"tp_degree": 1,
|
||
"dp_shard_degree": 1,
|
||
|
||
"param_names": [
|
||
"model.embed_tokens.weight",
|
||
"model.layers.0.self_attn.qkv_proj.weight",
|
||
...
|
||
],
|
||
|
||
"param_mappings": {
|
||
"model.embed_tokens.weight": {
|
||
"vllm_name": "model.embed_tokens.weight",
|
||
"shape": [152064, 2048],
|
||
"dtype": "torch.bfloat16",
|
||
"device": "cuda:0"
|
||
},
|
||
...
|
||
},
|
||
|
||
"ipc_handles": {
|
||
"model.embed_tokens.weight": {
|
||
"device_index": 0,
|
||
"ipc_handle_b64": "AmPA0pN...",
|
||
"storage_size": 623902720,
|
||
"storage_offset": 0,
|
||
"ref_counter_handle_b64": "Y2JY...",
|
||
"ref_counter_offset": 0,
|
||
"event_handle_b64": "wRIs...",
|
||
"event_sync_required": true,
|
||
"shape": [152064, 2048],
|
||
"dtype": "torch.bfloat16"
|
||
},
|
||
...
|
||
},
|
||
|
||
"shared_weights_enabled": true,
|
||
"single_copy_enabled": true,
|
||
"num_params": 255
|
||
}
|
||
```
|
||
|
||
#### Field Descriptions
|
||
|
||
| Field | Description |
|
||
|-------|-------------|
|
||
| `model` | HuggingFace model identifier |
|
||
| `tp_degree` | Tensor parallel degree (must be 1 for single-copy) |
|
||
| `param_names` | List of all parameter names in the model |
|
||
| `param_mappings` | Shape, dtype, and device info for each parameter |
|
||
| `ipc_handles` | CUDA IPC handles for reconstructing shared tensors |
|
||
| `ipc_handle_b64` | The actual CUDA IPC handle (base64-encoded bytes) |
|
||
| `ref_counter_handle_b64` | Reference counter for CUDA memory (base64) |
|
||
| `event_handle_b64` | CUDA event handle for synchronization (base64) |
|
||
| `storage_size` | Size of the underlying storage in bytes |
|
||
|
||
### How the Trainer Uses It
|
||
|
||
1. **Load Config**: Trainer reads `vllm_bridge_config.json`
|
||
2. **Create Shell Model**: Uses `AutoModelForCausalLM.from_config()` with meta tensors (no memory allocation)
|
||
3. **Attach IPC Handles**: For each parameter, reconstructs the tensor using `torch.UntypedStorage._new_shared_cuda()` with the IPC handles
|
||
4. **Verify Shapes**: Ensures trainer's model architecture matches vLLM's sharding
|
||
|
||
```python
|
||
# Simplified version of what happens internally:
|
||
for name, ipc_info in config["ipc_handles"].items():
|
||
# Decode IPC handle from base64
|
||
ipc_handle = base64.b64decode(ipc_info["ipc_handle_b64"])
|
||
|
||
# Reconstruct storage from IPC handle
|
||
storage = torch.UntypedStorage._new_shared_cuda(
|
||
device_index, ipc_handle, storage_size, ...
|
||
)
|
||
|
||
# Create tensor from shared storage
|
||
tensor = torch.tensor(storage).view(shape).to(dtype)
|
||
|
||
# Replace model parameter with shared tensor
|
||
model.get_parameter(name).data = tensor
|
||
```
|
||
|
||
### Specifying the Config Path Explicitly
|
||
|
||
If auto-detection isn't working (e.g., in complex cluster setups), you can specify the path explicitly:
|
||
|
||
```bash
|
||
# If vLLM writes config to a non-standard location:
|
||
python -u example_trainer/grpo.py \
|
||
--model-name Qwen/Qwen2.5-3B-Instruct \
|
||
--weight-bridge-mode shared_vllm \
|
||
--single-copy \
|
||
--vllm-config-path /shared/nfs/vllm_bridge_config.json \
|
||
--training-steps 50
|
||
```
|
||
|
||
### Common Issues
|
||
|
||
| Symptom | Cause | Fix |
|
||
|---------|-------|-----|
|
||
| "Could not find vllm_bridge_config.json" | vLLM didn't export config | Check `VLLM_ENABLE_SHARED_WEIGHTS=1` was set BEFORE starting vLLM |
|
||
| Config exists but has empty `ipc_handles` | Patch didn't run | Ensure vLLM is using our custom `vllm_api_server.py` |
|
||
| "tuple of 8 items expected" | IPC handle format mismatch | Update to latest code (handles all 8 CUDA IPC tuple components) |
|
||
| "size mismatch" errors | Tensor parallel mismatch | Use `tensor-parallel-size 1` for single-copy mode |
|
||
|
||
---
|
||
|
||
## FAQ & Troubleshooting
|
||
|
||
### Q: I get "Could not find vllm_bridge_config.json"
|
||
|
||
**A:** vLLM didn't export the IPC handles. Check:
|
||
|
||
1. `VLLM_ENABLE_SHARED_WEIGHTS=1` was set **before** starting vLLM
|
||
2. `LOGDIR` is set to a valid, writable directory
|
||
3. Look for export messages in vllm.log:
|
||
```bash
|
||
grep "Exported" vllm.log
|
||
```
|
||
|
||
If the file exists but in a different location, specify it explicitly:
|
||
```bash
|
||
python grpo.py ... --vllm-config-path /path/to/vllm_bridge_config.json
|
||
```
|
||
|
||
---
|
||
|
||
### Q: I get "CUDA out of memory" when starting the trainer
|
||
|
||
**A:** For single-copy mode, trainer and vLLM MUST be on the same GPU(s). Check:
|
||
|
||
```bash
|
||
# Both should use the same CUDA_VISIBLE_DEVICES
|
||
CUDA_VISIBLE_DEVICES=0 python ... vllm_api_server.py ...
|
||
CUDA_VISIBLE_DEVICES=0 python ... grpo.py ...
|
||
```
|
||
|
||
---
|
||
|
||
### Q: Trainer crashes with "Cannot copy out of meta tensor"
|
||
|
||
**A:** Some model buffers (like rotary embeddings) weren't initialized. This is a known issue being fixed. Update to the latest code.
|
||
|
||
---
|
||
|
||
### Q: Single-copy mode doesn't work with tensor-parallel > 1
|
||
|
||
**A:** Currently, single-copy mode only works with `tensor-parallel-size 1`. For larger models that need tensor parallelism, use a single GPU with a smaller model, or wait for multi-GPU single-copy support.
|
||
|
||
---
|
||
|
||
### Q: How do I check GPU memory usage?
|
||
|
||
**A:**
|
||
```bash
|
||
nvidia-smi
|
||
```
|
||
|
||
For single-copy mode with Qwen2.5-14B:
|
||
- GPU 0: ~28GB (shared between vLLM and trainer)
|
||
|
||
---
|
||
|
||
### Q: How do I stop all processes?
|
||
|
||
**A:**
|
||
```bash
|
||
pkill -9 -u $USER -f "vllm|grpo|python|run-api"
|
||
```
|
||
|
||
---
|
||
|
||
## Files in This Directory
|
||
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `grpo.py` | Main trainer script with all modes |
|
||
| `vllm_api_server.py` | Custom vLLM server with shared memory patches |
|
||
| `vllm_patching/` | vLLM patches for CUDA IPC support |
|
||
| `requirements.txt` | Python dependencies |
|
||
| `README.md` | This documentation |
|
||
|
||
### vllm_patching/ Directory
|
||
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `__init__.py` | Module exports and patch application |
|
||
| `patched_gpu_runner.py` | Patches GPUModelRunner to export CUDA IPC handles |
|
||
|
||
---
|
||
|
||
## Performance Comparison
|
||
|
||
| Mode | Sync Latency | Memory (14B model) | Best For |
|
||
|------|--------------|-------------------|----------|
|
||
| **Legacy** | 30-60s | 2x model | Debugging |
|
||
| **Single-Copy** | 0ms | 1x model (shared!) | Production |
|
||
| **LoRA** | 5-10s | 1x model + adapters | Memory-constrained |
|
||
|
||
---
|
||
|
||
## Checkpoint Locations
|
||
|
||
| Mode | Location | Size |
|
||
|------|----------|------|
|
||
| Legacy | `trained_model_checkpoints/step_N/` | ~28GB (14B model) |
|
||
| Single-Copy | `trained_model_checkpoints/step_N/` | ~28GB |
|
||
| LoRA | `trained_model_checkpoints/adapter_step_N/` | ~50MB |
|