| .. | ||
| vllm_patching | ||
| __init__.py | ||
| grpo.py | ||
| README.md | ||
| requirements.txt | ||
| vllm_api_server.py | ||
| vllm_weight_bridge.py | ||
GRPO Example Trainer
This directory contains an example script (grpo.py) demonstrating how to integrate a custom training loop with the Atropos API for reinforcement learning using the GRPO (Group Relative Policy Optimization) algorithm.
Training Modes
The trainer supports three weight synchronization modes:
| Mode | Description | Sync Latency | Best For |
|---|---|---|---|
Legacy (none) |
Save checkpoints, restart vLLM | ~30-60 seconds | Simple setups, debugging |
Shared vLLM (shared_vllm) |
Direct shared memory updates via NCCL | ~0 ms | Production, maximum throughput |
LoRA (lora_only) |
Train adapters, hot-swap | ~1-5 seconds | Memory-constrained, fast iteration |
Quick Start with GSM8k (Shared vLLM Mode)
This is the recommended production setup for maximum training throughput.
Prerequisites
# Install dependencies
pip install -r example_trainer/requirements.txt
# Install GSM8k environment dependencies
pip install datasets latex2sympy2_extended math_verify
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ SHARED VLLM TRAINING ARCHITECTURE │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────────┐ │
│ │ GSM8k Env │───▶│ Atropos API │◀───│ GRPO Trainer (GPU 2) │ │
│ │ (problems) │ │ (batching) │ │ - Loads model for training │ │
│ └─────────────┘ └─────────────┘ │ - Broadcasts weights via NCCL │ │
│ │ └─────────────────────────────────┘ │
│ │ │ │
│ │ │ NCCL Broadcast │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ vLLM Inference Server (GPUs 0-1) │ │
│ │ - Model weights in shared memory │ │
│ │ - Weight updater threads receive NCCL updates │ │
│ │ - Generates rollouts for scoring │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Step-by-Step Guide (Tested & Working)
IMPORTANT: GPU Allocation
- vLLM runs on GPUs 0-1 (tensor-parallel)
- Trainer runs on GPU 2 (separate to avoid OOM)
Step 1: Kill Any Existing Processes
pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3
Step 2: Setup Directory
cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log
Step 3: Set Environment Variables
export VLLM_ENABLE_SHARED_WEIGHTS=1
export NUM_INFERENCE_NODES=0
export MASTER_ADDR=localhost
export MASTER_PORT=29500
Step 4: Start Atropos API
python -m atroposlib.cli.run_api > api.log 2>&1 &
echo "Atropos API started"
sleep 3
Step 5: Start GSM8K Environment
python environments/gsm8k_server.py > gsm8k.log 2>&1 &
echo "GSM8K environment started"
sleep 3
Step 6: Start vLLM Server on GPUs 0-1
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py \
--model Qwen/Qwen2.5-14B-Instruct \
--tensor-parallel-size 2 \
--port 9001 \
--dtype bfloat16 \
> vllm.log 2>&1 &
echo "vLLM starting on GPUs 0,1..."
Step 7: Wait for vLLM to Load
tail -f vllm.log
Wait until you see: Uvicorn running on http://0.0.0.0:9001
Then press Ctrl+C to stop tailing.
Step 8: Verify Shared Memory Setup
grep -E "thread|updater|Exported|Shared memory" vllm.log
You should see:
[vLLM Patch] ✓ Shared memory setup complete!
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP0)
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP1)
Step 9: Start Trainer on GPU 2
CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-14B-Instruct \
--weight-bridge-mode shared_vllm \
--vllm-port 9001 \
--lr 1e-6 \
--batch-size 4 \
--training-steps 100 \
--use-shared-memory \
2>&1 | tee trainer.log
Step 10: Monitor Training
tail -f trainer.log
You should see:
[Bridge] ✓ Gloo group created
[Bridge] ✓ NCCL group created
[Bridge] ✓ All ranks synchronized and ready
[Bridge] Mapped 195/339 params from vLLM to trainer
Step 1/100
Quick Copy-Paste (All-in-One)
# Kill everything and setup
pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3
cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log
# Environment variables
export VLLM_ENABLE_SHARED_WEIGHTS=1 NUM_INFERENCE_NODES=0 MASTER_ADDR=localhost MASTER_PORT=29500
# Start services
python -m atroposlib.cli.run_api > api.log 2>&1 &
sleep 3
python environments/gsm8k_server.py > gsm8k.log 2>&1 &
sleep 3
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py --model Qwen/Qwen2.5-14B-Instruct --tensor-parallel-size 2 --port 9001 --dtype bfloat16 > vllm.log 2>&1 &
echo "Waiting for vLLM to load... (check: tail -f vllm.log)"
echo "Once ready, run the trainer command below:"
echo ""
echo "CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py --model-name Qwen/Qwen2.5-14B-Instruct --weight-bridge-mode shared_vllm --vllm-port 9001 --lr 1e-6 --batch-size 4 --training-steps 100 --use-shared-memory 2>&1 | tee trainer.log"
How Shared vLLM Mode Works
The Problem
Traditional RL training requires syncing model weights between the trainer and inference server. This is slow:
- Save checkpoint → Load into vLLM → Restart server = 30-60 seconds per sync
Two Solutions Available
Option 1: Broadcast Mode (--use-shared-memory)
Two copies of the model, but instant NCCL sync. Use when trainer is on different GPUs.
Trainer (GPU 2) NCCL vLLM Workers (GPUs 0-1)
│ │ │
│ optimizer.step() │ │
│ ─────────────────────────────────────────────► │
│ broadcast_weights() │ │ Thread receives
│ │ │ weights via NCCL
│ │ │ Copies to shared
│ │ │ memory tensors
│ │ │
│ Next training step │ │ Ready for inference
- Memory: 2x model size (trainer copy + vLLM copy)
- Sync Latency: ~0ms (NCCL broadcast)
- GPU Layout: Trainer on different GPUs than vLLM
Option 2: Single-Copy Mode (--single-copy) ⭐ RECOMMENDED
TRUE shared memory - only ONE copy of the model! Use when trainer is on same GPUs.
┌────────────────────────────────────────────────────────────┐
│ SAME GPU(s) │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ SHARED MODEL TENSORS │ │
│ │ (only ONE copy in GPU memory!) │ │
│ └──────────────────────────────────────────────────┘ │
│ ▲ ▲ │
│ │ Reads/Writes │ Reads │
│ ┌────────┴───────┐ ┌────────┴───────┐ │
│ │ Trainer │ │ vLLM │ │
│ │ (gradients) │ │ (inference) │ │
│ └────────────────┘ └────────────────┘ │
│ │ │
│ │ optimizer.step() │
│ │ (updates shared tensors in-place) │
│ ▼ │
│ vLLM immediately sees new weights! │
└────────────────────────────────────────────────────────────┘
- Memory: 1x model size (truly shared via CUDA IPC!)
- Sync Latency: 0ms (same memory, no copy needed)
- GPU Layout: Trainer on SAME GPUs as vLLM (required!)
When to Use Which
| Mode | Memory | Sync | Use When |
|---|---|---|---|
Broadcast (--use-shared-memory) |
2x model | ~0ms NCCL | Trainer on different GPUs |
Single-Copy (--single-copy) |
1x model | 0ms | Trainer on same GPUs, memory constrained |
Single-Copy Mode Usage
# vLLM and Trainer on SAME GPUs (0,1)
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py \
--model Qwen/Qwen2.5-14B-Instruct \
--tensor-parallel-size 2 \
--port 9001 \
> vllm.log 2>&1 &
# Wait for vLLM to load...
# Trainer also on GPUs 0,1 - shares vLLM's tensors!
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-14B-Instruct \
--weight-bridge-mode shared_vllm \
--single-copy \
--training-steps 100 \
2>&1 | tee trainer.log
Alternative Modes
Mode 1: Legacy (Checkpoint + Restart)
For simple setups or debugging. Saves checkpoints and can restart vLLM.
python example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-3B-Instruct \
--weight-bridge-mode none \
--training-steps 100 \
--vllm-restart-interval 10 \
--batch-size 2 \
--lr 1e-5
Mode 2: LoRA Adapters
Trains only adapter weights. Small checkpoints, lower memory.
python example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-3B-Instruct \
--weight-bridge-mode lora_only \
--lora-r 16 \
--lora-alpha 32 \
--training-steps 100 \
--batch-size 2 \
--lr 1e-4
Configuration Reference
Environment Variables
| Variable | Required | Description | Example |
|---|---|---|---|
VLLM_ENABLE_SHARED_WEIGHTS |
Yes (shared mode) | Enable vLLM patching | 1 |
NUM_INFERENCE_NODES |
Yes | Number of vLLM nodes (0 = local) | 0 |
MASTER_ADDR |
Yes | Rendezvous address | localhost |
MASTER_PORT |
Yes | Rendezvous port | 29500 |
CUDA_VISIBLE_DEVICES |
Recommended | GPU allocation | 0,1 or 2 |
Trainer CLI Options
| Option | Default | Description |
|---|---|---|
--model-name |
(required) | HuggingFace model ID |
--weight-bridge-mode |
none |
none, shared_vllm, or lora_only |
--use-shared-memory |
False |
Enable NCCL weight broadcasting |
--vllm-port |
9001 |
vLLM server port |
--training-steps |
10 |
Total optimization steps |
--batch-size |
2 |
Micro-batch size |
--lr |
1e-5 |
Learning rate |
--save-path |
trained_model_checkpoints |
Checkpoint directory |
vLLM Server Options
| Option | Description |
|---|---|
--model |
HuggingFace model ID |
--tensor-parallel-size |
Number of GPUs for tensor parallelism |
--port |
Server port (default: 9001) |
--dtype |
Model dtype (bfloat16, float16, auto) |
FAQ & Troubleshooting
Q: The trainer is stuck at "Creating Gloo process group..."
A: This means the trainer is waiting for the vLLM weight updater threads to connect. Check if the threads started:
grep -E "thread|updater|ERROR" vllm.log
You should see:
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP0)
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP1)
If not, ensure VLLM_ENABLE_SHARED_WEIGHTS=1 was set before starting vLLM.
Q: I get "CUDA out of memory" when starting the trainer
A: The trainer is trying to load the model on the same GPUs as vLLM. Use separate GPUs:
# vLLM on GPUs 0-1
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py ...
# Trainer on GPU 2
CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py ...
Q: I see "daemonic processes are not allowed to have children"
A: This was a bug in older versions. The fix uses threads instead of processes for the weight updater. Make sure you have the latest patched_gpu_runner.py.
Q: The vllm_bridge_config.json shows param_mappings: {}
A: The vLLM patches didn't run. Check:
VLLM_ENABLE_SHARED_WEIGHTS=1was set before starting vLLM- Look for
[vLLM Patch] ✓ Exported X paramsin vllm.log
grep "Exported" vllm.log
Q: How do I verify the NCCL connection is working?
A: Check the trainer log for these messages:
[Bridge] ✓ Gloo group created
[Bridge] ✓ NCCL group created
[Bridge] ✓ All ranks synchronized and ready
Q: What's the difference between Gloo and NCCL?
A:
- Gloo: CPU-based coordination protocol. Used for synchronization barriers.
- NCCL: GPU-based high-speed protocol. Used for broadcasting weight tensors.
Both are needed: Gloo for coordination, NCCL for fast tensor transfers.
Q: How do I check GPU memory usage?
A:
nvidia-smi
Expected for Qwen2.5-14B with shared mode:
- GPUs 0-1: ~168GB each (vLLM workers)
- GPU 2: ~29GB (trainer)
Q: How do I stop all processes?
A:
pkill -9 -u $USER -f "vllm|grpo|python|run-api"
Q: The training is slow / not progressing
A: Check if all services are running:
ps aux | grep -E "(run_api|vllm|grpo|gsm8k)" | grep $USER
Check logs for errors:
tail -20 api.log
tail -20 gsm8k.log
tail -20 vllm.log
tail -20 trainer.log
Q: How do I use a smaller model for testing?
A: Use Qwen2.5-3B-Instruct with single GPU:
# vLLM on GPU 0
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
--model Qwen/Qwen2.5-3B-Instruct \
--port 9001 \
> vllm.log 2>&1 &
# Trainer on GPU 1
CUDA_VISIBLE_DEVICES=1 python -u example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-3B-Instruct \
--weight-bridge-mode shared_vllm \
--use-shared-memory \
--training-steps 10 \
2>&1 | tee trainer.log
Files in This Directory
| File | Description |
|---|---|
grpo.py |
Main trainer script with all modes |
vllm_api_server.py |
Custom vLLM server with shared memory patches |
vllm_weight_bridge.py |
NCCL bridge for weight synchronization |
vllm_patching/ |
vLLM patches for shared memory support |
requirements.txt |
Python dependencies |
README.md |
This documentation |
vllm_patching/ Directory
| File | Description |
|---|---|
__init__.py |
Module exports |
patched_gpu_runner.py |
Patches GPUModelRunner for shared memory |
weight_updater.py |
Thread that receives NCCL weight broadcasts |
distributed_utils.py |
Process group initialization helpers |
Performance Comparison
| Mode | Sync Latency | Memory (14B model) | Best For |
|---|---|---|---|
| Legacy | 30-60s | 2x model | Debugging |
| Shared vLLM | ~0ms | 1x model (shared) + trainer | Production |
| LoRA | 5-10s | 1x model + adapters | Memory-constrained |
Checkpoint Locations
| Mode | Location | Size |
|---|---|---|
| Legacy | trained_model_checkpoints/step_N/ |
~28GB (14B model) |
| Shared vLLM | trained_model_checkpoints/step_N/ |
~28GB |
| LoRA | trained_model_checkpoints/adapter_step_N/ |
~50MB |
Example Training Runs
Quick Test (3B model, LoRA)
python example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-3B-Instruct \
--weight-bridge-mode lora_only \
--training-steps 5 \
--batch-size 1
Production (14B model, Shared vLLM)
# See Step-by-Step Guide above
CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-14B-Instruct \
--weight-bridge-mode shared_vllm \
--use-shared-memory \
--training-steps 1000 \
--batch-size 4 \
--lr 1e-6
Multi-GPU Training (70B model)
# vLLM on GPUs 0-3 (tensor parallel 4)
CUDA_VISIBLE_DEVICES=0,1,2,3 python -u example_trainer/vllm_api_server.py \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--port 9001 \
> vllm.log 2>&1 &
# Trainer on GPUs 4-5
CUDA_VISIBLE_DEVICES=4,5 python -u example_trainer/grpo.py \
--model-name Qwen/Qwen2.5-72B-Instruct \
--weight-bridge-mode shared_vllm \
--use-shared-memory \
--training-steps 100 \
2>&1 | tee trainer.log