atropos/example_trainer/README.md
Jai Suphavadeeprasit a6faaee71d vllm weight bridge
2026-02-13 11:26:25 -05:00

779 lines
45 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GRPO Example Trainer
This directory contains an example script (`grpo.py`) demonstrating how to integrate a custom training loop with the Atropos API for reinforcement learning using the GRPO (Group Relative Policy Optimization) algorithm.
## Training Modes
The trainer supports three weight synchronization modes:
| Mode | Description | Sync Latency | Best For |
|------|-------------|--------------|----------|
| **Legacy** (`none`) | Save checkpoints, restart vLLM | ~30-60 seconds | Simple setups, debugging |
| **Single-Copy** (`shared_vllm`) | Direct CUDA IPC - ONE model copy! | 0 ms | Production, memory efficiency |
| **LoRA** (`lora_only`) | Train adapters, hot-swap | ~1-5 seconds | Memory-constrained, fast iteration |
---
## Quick Start with GSM8k (Single-Copy Mode)
This is the **recommended** production setup for maximum training throughput and memory efficiency.
### Prerequisites
```bash
# Install dependencies
pip install -r example_trainer/requirements.txt
# Install GSM8k environment dependencies
pip install datasets latex2sympy2_extended math_verify
```
### Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ SINGLE-COPY TRAINING ARCHITECTURE │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────────┐ │
│ │ GSM8k Env │───▶│ Atropos API │◀───│ GRPO Trainer │ │
│ │ (problems) │ │ (batching) │ │ - Attached to vLLM's tensors │ │
│ └─────────────┘ └─────────────┘ │ - optimizer.step() updates both │ │
│ │ └─────────────────────────────────┘ │
│ │ │ │
│ │ │ CUDA IPC │
│ │ │ (same memory!) │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ vLLM Inference Server (GPU 0) │ │
│ │ - Model weights in GPU memory │ │
│ │ - Trainer sees same tensors via IPC │ │
│ │ - Generates rollouts for scoring │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### How Single-Copy Mode Works
```
┌────────────────────────────────────────────────────────────┐
│ SAME GPU(s) │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ SHARED MODEL TENSORS │ │
│ │ (only ONE copy in GPU memory!) │ │
│ └──────────────────────────────────────────────────┘ │
│ ▲ ▲ │
│ │ Reads/Writes │ Reads │
│ ┌────────┴───────┐ ┌────────┴───────┐ │
│ │ Trainer │ │ vLLM │ │
│ │ (gradients) │ │ (inference) │ │
│ └────────────────┘ └────────────────┘ │
│ │ │
│ │ optimizer.step() │
│ │ (updates shared tensors in-place) │
│ ▼ │
│ vLLM immediately sees new weights! │
└────────────────────────────────────────────────────────────┘
```
- **Memory**: 1x model size (truly shared via CUDA IPC!)
- **Sync Latency**: 0ms (same memory, no copy needed)
- **Requirement**: Trainer and vLLM on SAME GPU(s)
---
### Step-by-Step Guide
**IMPORTANT: GPU Allocation**
- vLLM and Trainer run on the SAME GPU(s)
- Use `tensor-parallel-size 1` for single-copy mode (TP>1 not yet supported)
---
#### Step 1: Kill Any Existing Processes
```bash
pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3
```
#### Step 2: Setup Directory
```bash
cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log
```
#### Step 3: Set Environment Variables
```bash
export VLLM_ENABLE_SHARED_WEIGHTS=1
export VLLM_SKIP_WEIGHT_DAEMON=1
export NUM_INFERENCE_NODES=0
export LOGDIR=.
```
#### Step 4: Start vLLM Server
```bash
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
--model Qwen/Qwen2.5-14B-Instruct \
--tensor-parallel-size 1 \
--port 9001 \
> vllm.log 2>&1 &
echo "vLLM starting on GPU 0..."
```
#### Step 5: Wait for vLLM to Load
```bash
tail -f vllm.log
```
Wait until you see: `Uvicorn running on http://0.0.0.0:9001`
Then press **Ctrl+C** to stop tailing.
#### Step 6: Verify IPC Handles Exported
```bash
grep -E "IPC|Exported|single_copy" vllm.log
```
You should see:
```
[vLLM Patch] Exported X IPC handles for single-copy mode
[vLLM Patch] ✓ Exported 339 params to vllm_bridge_config.json
```
#### Step 7: Start GSM8K Environment
```bash
python environments/gsm8k_server.py serve \
--slurm False \
--openai.model_name Qwen/Qwen2.5-14B-Instruct \
--openai.base_url http://localhost:9001/v1 \
--openai.server_type vllm \
--openai.api_key x \
--env.tokenizer_name Qwen/Qwen2.5-14B-Instruct \
--env.use_wandb False \
> gsm8k.log 2>&1 &
echo "GSM8K environment started"
sleep 10
```
#### Step 8: Start Trainer (Same GPU as vLLM!)
```bash
CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-14B-Instruct \
--weight-bridge-mode shared_vllm \
--training-steps 100 \
2>&1 | tee trainer.log
```
#### Step 9: Monitor Training
```bash
tail -f trainer.log
```
You should see:
```
[Setup] ✓ Attached 195 tensors to vLLM's shared memory
[Setup] ✓ Single-copy mode active - using vLLM's tensors directly!
[2/2] Starting training for 100 steps
Step 1/100
[SINGLE-COPY] Weights updated in-place - step 1
```
---
### Quick Copy-Paste (All-in-One)
```bash
# Kill everything and setup
pkill -9 -u $USER -f "vllm|grpo|python" 2>/dev/null; sleep 3
cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json *.log
# Environment variables
export VLLM_ENABLE_SHARED_WEIGHTS=1 VLLM_SKIP_WEIGHT_DAEMON=1 NUM_INFERENCE_NODES=0 LOGDIR=.
# Start vLLM
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
--model Qwen/Qwen2.5-14B-Instruct --tensor-parallel-size 1 --port 9001 > vllm.log 2>&1 &
echo "Waiting 90s for vLLM..."; sleep 90
# Start GSM8k environment
python environments/gsm8k_server.py serve --slurm False \
--openai.model_name Qwen/Qwen2.5-14B-Instruct \
--openai.base_url http://localhost:9001/v1 \
--openai.server_type vllm --openai.api_key x \
--env.tokenizer_name Qwen/Qwen2.5-14B-Instruct \
--env.use_wandb False > gsm8k.log 2>&1 &
sleep 10
# Start trainer (same GPU!)
CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-14B-Instruct \
--weight-bridge-mode shared_vllm \
--training-steps 100 \
2>&1 | tee trainer.log
```
---
## How Each Mode Works (Data Flow Diagrams)
### Single-Copy Mode (`--weight-bridge-mode shared_vllm`) ⭐ RECOMMENDED
**The Magic**: Trainer and vLLM share the EXACT SAME GPU memory via CUDA IPC.
```
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ SINGLE-COPY MODE - COMPLETE DATA FLOW │
│ │
│ STEP 1: GSM8k sends problem │
│ ┌──────────────────┐ │
│ │ GSM8k Server │──── "What is 15 × 7?" ────▶┌──────────────────┐ │
│ │ (Environment) │ │ Atropos API │ │
│ └──────────────────┘ │ (Batching) │ │
│ └────────┬─────────┘ │
│ │ │
│ STEP 2: Atropos forwards to vLLM │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
│ │ GPU MEMORY │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ MODEL WEIGHTS (ONE COPY - SHARED!) │ │ │
│ │ │ │ │ │
│ │ │ embed_tokens.weight, layers.*.qkv_proj, ..., lm_head.weight │ │ │
│ │ │ (address: 0x7f8a12340000) │ │ │
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
│ │ ▲ ▲ │ │
│ │ │ STEP 3: READ │ STEP 6: WRITE │ │
│ │ │ (generate tokens) │ (optimizer.step) │ │
│ │ ┌────────┴────────┐ ┌─────────┴─────────┐ │ │
│ │ │ vLLM Server │ │ Trainer │ │ │
│ │ │ │ │ (grpo.py) │ │ │
│ │ │ Generates: │ │ │ │ │
│ │ │ "15 × 7 = 105" │ │ STEP 5: Compute │ │ │
│ │ │ │ │ GRPO loss & │ │ │
│ │ └────────┬────────┘ │ gradients │ │ │
│ │ │ └─────────▲─────────┘ │ │
│ └───────────┼──────────────────────────────────────────────┼────────────────────┘ │
│ │ │ │
│ │ STEP 4: Return completion │ │
│ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ GSM8k Server │───────────────────────────────────────┘ │
│ │ (Scoring) │ │
│ │ │ Scores: "15 × 7 = 105" ✓ reward=1.0 │
│ │ │ "15 × 7 = 100" ✗ reward=0.0 │
│ └──────────────────┘ │
│ │
│ STEP 7: IMMEDIATE UPDATE │
│ ┌─────────────────────────────────────────────────────────────────────────────┐ │
│ │ After optimizer.step(), vLLM's NEXT inference uses the NEW weights! │ │
│ │ NO SYNC NEEDED - it's the same memory! │ │
│ └─────────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────┘
```
**Key Points:**
- ✅ ONE copy of weights in GPU memory
- ✅ 0ms sync latency (same memory!)
- ✅ Memory efficient (~1x model size)
- ⚠️ Requires same GPU for trainer and vLLM
---
### LoRA Mode (`--weight-bridge-mode lora_only`)
**The Idea**: Freeze base model, only train small adapter layers. Hot-swap adapters into vLLM.
```
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ LORA MODE - COMPLETE DATA FLOW │
│ │
│ STEP 1: GSM8k sends problem │
│ ┌──────────────────┐ │
│ │ GSM8k Server │──── "What is 15 × 7?" ────▶┌──────────────────┐ │
│ │ (Environment) │ │ Atropos API │ │
│ └──────────────────┘ └────────┬─────────┘ │
│ │ │
│ STEP 2: Forward to vLLM ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
│ │ vLLM GPU MEMORY │ │
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ BASE MODEL (frozen, ~6GB) │ │ │
│ │ │ + LORA ADAPTER A (current, ~50MB) │ │ │
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ │ STEP 3: Inference with base + adapter A │ │
│ │ ▼ │ │
│ │ ┌────────────────────┐ │ │
│ │ │ vLLM Server │ ──── "15 × 7 = 105" ────▶ │ │
│ │ └────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
│ │ TRAINER GPU MEMORY (separate!) │ │
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ BASE MODEL (frozen, ~6GB) │ │ │
│ │ │ + LORA ADAPTER B (training, ~50MB) ◀── gradients flow here only! │ │ │
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ │ STEP 4-5: Receive rollout, compute loss, update adapter B │ │
│ │ ▼ │ │
│ │ ┌────────────────────┐ │ │
│ │ │ Trainer │ │ │
│ │ │ (grpo.py) │ │ │
│ │ └────────┬───────────┘ │ │
│ └───────────┼──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ STEP 6: Every N steps, save adapter B to disk │
│ ▼ │
│ ┌──────────────────┐ STEP 7: POST /lora/load ┌──────────────────┐ │
│ │ adapter_step_N/ │ ─────────────────────────────────▶│ vLLM Server │ │
│ │ (50MB on disk) │ │ Swaps A → B │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ STEP 8: Next inference uses NEW adapter B │
│ ┌─────────────────────────────────────────────────────────────────────────────┐ │
│ │ Sync latency: 1-5 seconds (save to disk + HTTP load) │ │
│ │ Memory: 2x base model + adapters │ │
│ └─────────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────┘
```
**Key Points:**
- ✅ Small adapter files (~50MB vs ~28GB)
- ✅ Works on separate GPUs
- ✅ Easy to switch between adapters
- ⚠️ 1-5 second sync latency
- ⚠️ 2x base model memory (trainer + vLLM)
---
### Legacy Mode (`--weight-bridge-mode none`)
**The Simple Approach**: Save full checkpoints, restart vLLM to load new weights.
```
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ LEGACY MODE - COMPLETE DATA FLOW │
│ │
│ STEP 1: GSM8k sends problem │
│ ┌──────────────────┐ │
│ │ GSM8k Server │──── "What is 15 × 7?" ────▶┌──────────────────┐ │
│ │ (Environment) │ │ Atropos API │ │
│ └──────────────────┘ └────────┬─────────┘ │
│ │ │
│ STEP 2: Forward to vLLM ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
│ │ vLLM GPU MEMORY │ │
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ FULL MODEL - Version 1 (~28GB) │ │ │
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ │ STEP 3: Inference │ │
│ │ ▼ │ │
│ │ ┌────────────────────┐ │ │
│ │ │ vLLM Server │ ──── "15 × 7 = 105" ────▶ │ │
│ │ └────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
│ │ TRAINER GPU MEMORY (separate!) │ │
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ FULL MODEL - Version 2 (~28GB + gradients + optimizer) │ │ │
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ │ STEP 4-5: Receive rollout, compute loss, update weights │ │
│ │ ▼ │ │
│ │ ┌────────────────────┐ │ │
│ │ │ Trainer │ │ │
│ │ │ (grpo.py) │ │ │
│ │ └────────┬───────────┘ │ │
│ └───────────┼──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ STEP 6: Every N steps, save FULL checkpoint to disk (~28GB) │
│ ▼ │
│ ┌──────────────────┐ │
│ │ checkpoint/ │ │
│ │ step_N/ │ (28GB on disk!) │
│ │ - model.safetensors │
│ │ - config.json │
│ └────────┬─────────┘ │
│ │ │
│ │ STEP 7: RESTART vLLM with new checkpoint │
│ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ 1. Kill vLLM process │ │
│ │ │ 2. Start new vLLM with --model checkpoint/step_N/ │ │
│ │ │ 3. Wait for model to load (~30-60 seconds) │ │
│ │ │ 4. Resume training │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────────┐ │
│ │ vLLM GPU MEMORY (restarted) │ │
│ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ FULL MODEL - Version 2 (loaded from checkpoint) │ │ │
│ │ └────────────────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────────┘ │
│ │
│ STEP 8: Next inference uses updated model │
│ ┌─────────────────────────────────────────────────────────────────────────────┐ │
│ │ Sync latency: 30-60 seconds (save + restart + reload) │ │
│ │ Memory: 2x full model │ │
│ │ Disk: 28GB per checkpoint │ │
│ └─────────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────┘
```
**Key Points:**
- ✅ Simple to understand
- ✅ Works on any setup
- ✅ Good for debugging
- ⚠️ 30-60 second sync latency
- ⚠️ 2x GPU memory (trainer + vLLM)
- ⚠️ Large checkpoint files (~28GB each)
---
## Mode Comparison Summary
```
┌──────────────────────────────────────────────────────────────────────────────────┐
│ MODE COMPARISON AT A GLANCE │
├────────────────┬───────────────┬────────────────┬────────────────────────────────┤
│ │ SINGLE-COPY │ LORA │ LEGACY │
├────────────────┼───────────────┼────────────────┼────────────────────────────────┤
│ Sync Latency │ 0 ms ⚡ │ 1-5 sec │ 30-60 sec │
│ GPU Memory │ 1x model │ 2x model │ 2x model │
│ Disk Space │ 28GB/ckpt │ 50MB/adapter │ 28GB/ckpt │
│ Complexity │ Medium │ Medium │ Simple │
│ Same GPU? │ Required ⚠️ │ Optional │ Optional │
│ Best For │ Production │ Experiments │ Debugging │
└────────────────┴───────────────┴────────────────┴────────────────────────────────┘
```
---
## Alternative Mode Commands
### Legacy Mode (Checkpoint + Restart)
For simple setups or debugging. Saves checkpoints and restarts vLLM to load new weights.
```bash
python example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-3B-Instruct \
--weight-bridge-mode none \
--training-steps 100 \
--vllm-restart-interval 10 \
--batch-size 2 \
--lr 1e-5
```
### LoRA Mode (Adapter Training)
Trains only adapter weights. Small checkpoints, lower memory.
```bash
python example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-3B-Instruct \
--weight-bridge-mode lora_only \
--lora-r 16 \
--lora-alpha 32 \
--training-steps 100 \
--batch-size 2 \
--lr 1e-4
```
---
## Configuration Reference
### Environment Variables
| Variable | Required | Description | Example |
|----------|----------|-------------|---------|
| `VLLM_ENABLE_SHARED_WEIGHTS` | Yes (single-copy) | Enable vLLM patching for IPC | `1` |
| `VLLM_SKIP_WEIGHT_DAEMON` | Yes (single-copy) | Skip NCCL daemon (not needed) | `1` |
| `NUM_INFERENCE_NODES` | Yes | Number of vLLM nodes (0 = local) | `0` |
| `LOGDIR` | Recommended | Directory for vllm_bridge_config.json | `.` |
| `CUDA_VISIBLE_DEVICES` | Recommended | GPU allocation | `0` |
### Trainer CLI Options
| Option | Default | Description |
|--------|---------|-------------|
| `--model-name` | (required) | HuggingFace model ID |
| `--weight-bridge-mode` | `none` | `none`, `shared_vllm`, or `lora_only` |
| `--single-copy` | `false` | Enable TRUE single-copy mode via CUDA IPC |
| `--vllm-config-path` | (auto-detect) | Explicit path to `vllm_bridge_config.json` |
| `--vllm-port` | `9001` | vLLM server port |
| `--training-steps` | `10` | Total optimization steps |
| `--batch-size` | `2` | Micro-batch size |
| `--lr` | `1e-5` | Learning rate |
| `--save-path` | `trained_model_checkpoints` | Checkpoint directory |
### vLLM Server Options
| Option | Description |
|--------|-------------|
| `--model` | HuggingFace model ID |
| `--tensor-parallel-size` | Number of GPUs (use 1 for single-copy) |
| `--port` | Server port (default: 9001) |
| `--dtype` | Model dtype (`bfloat16`, `float16`, `auto`) |
| `--gpu-memory-utilization` | Fraction of GPU memory for KV cache (default: 0.9) |
---
## The vLLM Bridge Config (vllm_bridge_config.json)
The `vllm_bridge_config.json` file is the critical communication mechanism between the vLLM inference server and the GRPO trainer in single-copy mode. Understanding this file is essential for debugging and advanced configurations.
### What It Is
When you start vLLM with `VLLM_ENABLE_SHARED_WEIGHTS=1`, the patched `GPUModelRunner` exports CUDA IPC (Inter-Process Communication) handles for all model tensors. These handles allow another process (the trainer) to access the exact same GPU memory—no copying required.
### Why It's Important
1. **True Single-Copy Architecture**: Instead of loading the model twice (once for training, once for inference), both processes share the same tensors in GPU memory.
2. **Zero-Latency Weight Updates**: When `optimizer.step()` modifies the weights, vLLM immediately sees the changes—no serialization, no network transfer, no disk I/O.
3. **Memory Efficiency**: For a 7B model (~14GB in bf16), you save ~14GB of GPU memory compared to having two separate copies.
### File Location
The trainer searches for `vllm_bridge_config.json` in this order:
1. **Explicit path** (if `--vllm-config-path` is provided)
2. **`$LOGDIR/vllm_bridge_config.json`** (if `LOGDIR` env var is set)
3. **`./vllm_bridge_config.json`** (current directory)
4. **`/tmp/atropos_bridge/vllm_bridge_config.json`** (default fallback)
**Tip**: To avoid "Config not found" errors, always set `LOGDIR`:
```bash
export LOGDIR=.
```
### File Contents
The JSON file contains everything needed to reconstruct tensor references in another process:
```json
{
"model": "Qwen/Qwen2.5-3B-Instruct",
"tp_degree": 1,
"dp_shard_degree": 1,
"param_names": [
"model.embed_tokens.weight",
"model.layers.0.self_attn.qkv_proj.weight",
...
],
"param_mappings": {
"model.embed_tokens.weight": {
"vllm_name": "model.embed_tokens.weight",
"shape": [152064, 2048],
"dtype": "torch.bfloat16",
"device": "cuda:0"
},
...
},
"ipc_handles": {
"model.embed_tokens.weight": {
"device_index": 0,
"ipc_handle_b64": "AmPA0pN...",
"storage_size": 623902720,
"storage_offset": 0,
"ref_counter_handle_b64": "Y2JY...",
"ref_counter_offset": 0,
"event_handle_b64": "wRIs...",
"event_sync_required": true,
"shape": [152064, 2048],
"dtype": "torch.bfloat16"
},
...
},
"shared_weights_enabled": true,
"single_copy_enabled": true,
"num_params": 255
}
```
#### Field Descriptions
| Field | Description |
|-------|-------------|
| `model` | HuggingFace model identifier |
| `tp_degree` | Tensor parallel degree (must be 1 for single-copy) |
| `param_names` | List of all parameter names in the model |
| `param_mappings` | Shape, dtype, and device info for each parameter |
| `ipc_handles` | CUDA IPC handles for reconstructing shared tensors |
| `ipc_handle_b64` | The actual CUDA IPC handle (base64-encoded bytes) |
| `ref_counter_handle_b64` | Reference counter for CUDA memory (base64) |
| `event_handle_b64` | CUDA event handle for synchronization (base64) |
| `storage_size` | Size of the underlying storage in bytes |
### How the Trainer Uses It
1. **Load Config**: Trainer reads `vllm_bridge_config.json`
2. **Create Shell Model**: Uses `AutoModelForCausalLM.from_config()` with meta tensors (no memory allocation)
3. **Attach IPC Handles**: For each parameter, reconstructs the tensor using `torch.UntypedStorage._new_shared_cuda()` with the IPC handles
4. **Verify Shapes**: Ensures trainer's model architecture matches vLLM's sharding
```python
# Simplified version of what happens internally:
for name, ipc_info in config["ipc_handles"].items():
# Decode IPC handle from base64
ipc_handle = base64.b64decode(ipc_info["ipc_handle_b64"])
# Reconstruct storage from IPC handle
storage = torch.UntypedStorage._new_shared_cuda(
device_index, ipc_handle, storage_size, ...
)
# Create tensor from shared storage
tensor = torch.tensor(storage).view(shape).to(dtype)
# Replace model parameter with shared tensor
model.get_parameter(name).data = tensor
```
### Specifying the Config Path Explicitly
If auto-detection isn't working (e.g., in complex cluster setups), you can specify the path explicitly:
```bash
# If vLLM writes config to a non-standard location:
python -u example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-3B-Instruct \
--weight-bridge-mode shared_vllm \
--single-copy \
--vllm-config-path /shared/nfs/vllm_bridge_config.json \
--training-steps 50
```
### Common Issues
| Symptom | Cause | Fix |
|---------|-------|-----|
| "Could not find vllm_bridge_config.json" | vLLM didn't export config | Check `VLLM_ENABLE_SHARED_WEIGHTS=1` was set BEFORE starting vLLM |
| Config exists but has empty `ipc_handles` | Patch didn't run | Ensure vLLM is using our custom `vllm_api_server.py` |
| "tuple of 8 items expected" | IPC handle format mismatch | Update to latest code (handles all 8 CUDA IPC tuple components) |
| "size mismatch" errors | Tensor parallel mismatch | Use `tensor-parallel-size 1` for single-copy mode |
---
## FAQ & Troubleshooting
### Q: I get "Could not find vllm_bridge_config.json"
**A:** vLLM didn't export the IPC handles. Check:
1. `VLLM_ENABLE_SHARED_WEIGHTS=1` was set **before** starting vLLM
2. `LOGDIR` is set to a valid, writable directory
3. Look for export messages in vllm.log:
```bash
grep "Exported" vllm.log
```
If the file exists but in a different location, specify it explicitly:
```bash
python grpo.py ... --vllm-config-path /path/to/vllm_bridge_config.json
```
---
### Q: I get "CUDA out of memory" when starting the trainer
**A:** For single-copy mode, trainer and vLLM MUST be on the same GPU(s). Check:
```bash
# Both should use the same CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0 python ... vllm_api_server.py ...
CUDA_VISIBLE_DEVICES=0 python ... grpo.py ...
```
---
### Q: Trainer crashes with "Cannot copy out of meta tensor"
**A:** Some model buffers (like rotary embeddings) weren't initialized. This is a known issue being fixed. Update to the latest code.
---
### Q: Single-copy mode doesn't work with tensor-parallel > 1
**A:** Currently, single-copy mode only works with `tensor-parallel-size 1`. For larger models that need tensor parallelism, use a single GPU with a smaller model, or wait for multi-GPU single-copy support.
---
### Q: How do I check GPU memory usage?
**A:**
```bash
nvidia-smi
```
For single-copy mode with Qwen2.5-14B:
- GPU 0: ~28GB (shared between vLLM and trainer)
---
### Q: How do I stop all processes?
**A:**
```bash
pkill -9 -u $USER -f "vllm|grpo|python|run-api"
```
---
## Files in This Directory
| File | Description |
|------|-------------|
| `grpo.py` | Main trainer script with all modes |
| `vllm_api_server.py` | Custom vLLM server with shared memory patches |
| `vllm_patching/` | vLLM patches for CUDA IPC support |
| `requirements.txt` | Python dependencies |
| `README.md` | This documentation |
### vllm_patching/ Directory
| File | Description |
|------|-------------|
| `__init__.py` | Module exports and patch application |
| `patched_gpu_runner.py` | Patches GPUModelRunner to export CUDA IPC handles |
---
## Performance Comparison
| Mode | Sync Latency | Memory (14B model) | Best For |
|------|--------------|-------------------|----------|
| **Legacy** | 30-60s | 2x model | Debugging |
| **Single-Copy** | 0ms | 1x model (shared!) | Production |
| **LoRA** | 5-10s | 1x model + adapters | Memory-constrained |
---
## Checkpoint Locations
| Mode | Location | Size |
|------|----------|------|
| Legacy | `trained_model_checkpoints/step_N/` | ~28GB (14B model) |
| Single-Copy | `trained_model_checkpoints/step_N/` | ~28GB |
| LoRA | `trained_model_checkpoints/adapter_step_N/` | ~50MB |