mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
389 lines
13 KiB
Markdown
389 lines
13 KiB
Markdown
# GRPO Example Trainer
|
|
|
|
This directory contains an example script (`grpo.py`) demonstrating how to integrate a custom training loop with the Atropos API for reinforcement learning using the GRPO (Group Relative Policy Optimization) algorithm.
|
|
|
|
## Training Modes
|
|
|
|
The trainer supports three weight synchronization modes:
|
|
|
|
| Mode | Description | Sync Latency | Best For |
|
|
|------|-------------|--------------|----------|
|
|
| **Legacy** (`none`) | Save checkpoints, restart vLLM | ~30-60 seconds | Simple setups, debugging |
|
|
| **Single-Copy** (`shared_vllm`) | Direct CUDA IPC - ONE model copy! | 0 ms | Production, memory efficiency |
|
|
| **LoRA** (`lora_only`) | Train adapters, hot-swap | ~1-5 seconds | Memory-constrained, fast iteration |
|
|
|
|
---
|
|
|
|
## Quick Start with GSM8k (Single-Copy Mode)
|
|
|
|
This is the **recommended** production setup for maximum training throughput and memory efficiency.
|
|
|
|
### Prerequisites
|
|
|
|
```bash
|
|
# Install dependencies
|
|
pip install -r example_trainer/requirements.txt
|
|
|
|
# Install GSM8k environment dependencies
|
|
pip install datasets latex2sympy2_extended math_verify
|
|
```
|
|
|
|
### Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ SINGLE-COPY TRAINING ARCHITECTURE │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────────┐ │
|
|
│ │ GSM8k Env │───▶│ Atropos API │◀───│ GRPO Trainer │ │
|
|
│ │ (problems) │ │ (batching) │ │ - Attached to vLLM's tensors │ │
|
|
│ └─────────────┘ └─────────────┘ │ - optimizer.step() updates both │ │
|
|
│ │ └─────────────────────────────────┘ │
|
|
│ │ │ │
|
|
│ │ │ CUDA IPC │
|
|
│ │ │ (same memory!) │
|
|
│ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ vLLM Inference Server (GPU 0) │ │
|
|
│ │ - Model weights in GPU memory │ │
|
|
│ │ - Trainer sees same tensors via IPC │ │
|
|
│ │ - Generates rollouts for scoring │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### How Single-Copy Mode Works
|
|
|
|
```
|
|
┌────────────────────────────────────────────────────────────┐
|
|
│ SAME GPU(s) │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────┐ │
|
|
│ │ SHARED MODEL TENSORS │ │
|
|
│ │ (only ONE copy in GPU memory!) │ │
|
|
│ └──────────────────────────────────────────────────┘ │
|
|
│ ▲ ▲ │
|
|
│ │ Reads/Writes │ Reads │
|
|
│ ┌────────┴───────┐ ┌────────┴───────┐ │
|
|
│ │ Trainer │ │ vLLM │ │
|
|
│ │ (gradients) │ │ (inference) │ │
|
|
│ └────────────────┘ └────────────────┘ │
|
|
│ │ │
|
|
│ │ optimizer.step() │
|
|
│ │ (updates shared tensors in-place) │
|
|
│ ▼ │
|
|
│ vLLM immediately sees new weights! │
|
|
└────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
- **Memory**: 1x model size (truly shared via CUDA IPC!)
|
|
- **Sync Latency**: 0ms (same memory, no copy needed)
|
|
- **Requirement**: Trainer and vLLM on SAME GPU(s)
|
|
|
|
---
|
|
|
|
### Step-by-Step Guide
|
|
|
|
**IMPORTANT: GPU Allocation**
|
|
- vLLM and Trainer run on the SAME GPU(s)
|
|
- Use `tensor-parallel-size 1` for single-copy mode (TP>1 not yet supported)
|
|
|
|
---
|
|
|
|
#### Step 1: Kill Any Existing Processes
|
|
|
|
```bash
|
|
pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3
|
|
```
|
|
|
|
#### Step 2: Setup Directory
|
|
|
|
```bash
|
|
cd ~/atropos_stuff/atropos
|
|
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log
|
|
```
|
|
|
|
#### Step 3: Set Environment Variables
|
|
|
|
```bash
|
|
export VLLM_ENABLE_SHARED_WEIGHTS=1
|
|
export VLLM_SKIP_WEIGHT_DAEMON=1
|
|
export NUM_INFERENCE_NODES=0
|
|
export LOGDIR=.
|
|
```
|
|
|
|
#### Step 4: Start vLLM Server
|
|
|
|
```bash
|
|
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
|
|
--model Qwen/Qwen2.5-14B-Instruct \
|
|
--tensor-parallel-size 1 \
|
|
--port 9001 \
|
|
> vllm.log 2>&1 &
|
|
echo "vLLM starting on GPU 0..."
|
|
```
|
|
|
|
#### Step 5: Wait for vLLM to Load
|
|
|
|
```bash
|
|
tail -f vllm.log
|
|
```
|
|
|
|
Wait until you see: `Uvicorn running on http://0.0.0.0:9001`
|
|
|
|
Then press **Ctrl+C** to stop tailing.
|
|
|
|
#### Step 6: Verify IPC Handles Exported
|
|
|
|
```bash
|
|
grep -E "IPC|Exported|single_copy" vllm.log
|
|
```
|
|
|
|
You should see:
|
|
```
|
|
[vLLM Patch] Exported X IPC handles for single-copy mode
|
|
[vLLM Patch] ✓ Exported 339 params to vllm_bridge_config.json
|
|
```
|
|
|
|
#### Step 7: Start GSM8K Environment
|
|
|
|
```bash
|
|
python environments/gsm8k_server.py serve \
|
|
--slurm False \
|
|
--openai.model_name Qwen/Qwen2.5-14B-Instruct \
|
|
--openai.base_url http://localhost:9001/v1 \
|
|
--openai.server_type vllm \
|
|
--openai.api_key x \
|
|
--env.tokenizer_name Qwen/Qwen2.5-14B-Instruct \
|
|
--env.use_wandb False \
|
|
> gsm8k.log 2>&1 &
|
|
echo "GSM8K environment started"
|
|
sleep 10
|
|
```
|
|
|
|
#### Step 8: Start Trainer (Same GPU as vLLM!)
|
|
|
|
```bash
|
|
CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
|
|
--model-name Qwen/Qwen2.5-14B-Instruct \
|
|
--weight-bridge-mode shared_vllm \
|
|
--training-steps 100 \
|
|
2>&1 | tee trainer.log
|
|
```
|
|
|
|
#### Step 9: Monitor Training
|
|
|
|
```bash
|
|
tail -f trainer.log
|
|
```
|
|
|
|
You should see:
|
|
```
|
|
[Setup] ✓ Attached 195 tensors to vLLM's shared memory
|
|
[Setup] ✓ Single-copy mode active - using vLLM's tensors directly!
|
|
[2/2] Starting training for 100 steps
|
|
Step 1/100
|
|
[SINGLE-COPY] Weights updated in-place - step 1
|
|
```
|
|
|
|
---
|
|
|
|
### Quick Copy-Paste (All-in-One)
|
|
|
|
```bash
|
|
# Kill everything and setup
|
|
pkill -9 -u $USER -f "vllm|grpo|python" 2>/dev/null; sleep 3
|
|
cd ~/atropos_stuff/atropos
|
|
rm -f vllm_bridge_config.json *.log
|
|
|
|
# Environment variables
|
|
export VLLM_ENABLE_SHARED_WEIGHTS=1 VLLM_SKIP_WEIGHT_DAEMON=1 NUM_INFERENCE_NODES=0 LOGDIR=.
|
|
|
|
# Start vLLM
|
|
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
|
|
--model Qwen/Qwen2.5-14B-Instruct --tensor-parallel-size 1 --port 9001 > vllm.log 2>&1 &
|
|
echo "Waiting 90s for vLLM..."; sleep 90
|
|
|
|
# Start GSM8k environment
|
|
python environments/gsm8k_server.py serve --slurm False \
|
|
--openai.model_name Qwen/Qwen2.5-14B-Instruct \
|
|
--openai.base_url http://localhost:9001/v1 \
|
|
--openai.server_type vllm --openai.api_key x \
|
|
--env.tokenizer_name Qwen/Qwen2.5-14B-Instruct \
|
|
--env.use_wandb False > gsm8k.log 2>&1 &
|
|
sleep 10
|
|
|
|
# Start trainer (same GPU!)
|
|
CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
|
|
--model-name Qwen/Qwen2.5-14B-Instruct \
|
|
--weight-bridge-mode shared_vllm \
|
|
--training-steps 100 \
|
|
2>&1 | tee trainer.log
|
|
```
|
|
|
|
---
|
|
|
|
## Alternative Modes
|
|
|
|
### Mode 1: Legacy (Checkpoint + Restart)
|
|
|
|
For simple setups or debugging. Saves checkpoints and restarts vLLM to load new weights.
|
|
|
|
```bash
|
|
python example_trainer/grpo.py \
|
|
--model-name Qwen/Qwen2.5-3B-Instruct \
|
|
--weight-bridge-mode none \
|
|
--training-steps 100 \
|
|
--vllm-restart-interval 10 \
|
|
--batch-size 2 \
|
|
--lr 1e-5
|
|
```
|
|
|
|
### Mode 2: LoRA Adapters
|
|
|
|
Trains only adapter weights. Small checkpoints, lower memory.
|
|
|
|
```bash
|
|
python example_trainer/grpo.py \
|
|
--model-name Qwen/Qwen2.5-3B-Instruct \
|
|
--weight-bridge-mode lora_only \
|
|
--lora-r 16 \
|
|
--lora-alpha 32 \
|
|
--training-steps 100 \
|
|
--batch-size 2 \
|
|
--lr 1e-4
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration Reference
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Required | Description | Example |
|
|
|----------|----------|-------------|---------|
|
|
| `VLLM_ENABLE_SHARED_WEIGHTS` | Yes (single-copy) | Enable vLLM patching for IPC | `1` |
|
|
| `VLLM_SKIP_WEIGHT_DAEMON` | Yes (single-copy) | Skip NCCL daemon (not needed) | `1` |
|
|
| `NUM_INFERENCE_NODES` | Yes | Number of vLLM nodes (0 = local) | `0` |
|
|
| `LOGDIR` | Recommended | Directory for vllm_bridge_config.json | `.` |
|
|
| `CUDA_VISIBLE_DEVICES` | Recommended | GPU allocation | `0` |
|
|
|
|
### Trainer CLI Options
|
|
|
|
| Option | Default | Description |
|
|
|--------|---------|-------------|
|
|
| `--model-name` | (required) | HuggingFace model ID |
|
|
| `--weight-bridge-mode` | `none` | `none`, `shared_vllm`, or `lora_only` |
|
|
| `--vllm-port` | `9001` | vLLM server port |
|
|
| `--training-steps` | `10` | Total optimization steps |
|
|
| `--batch-size` | `2` | Micro-batch size |
|
|
| `--lr` | `1e-5` | Learning rate |
|
|
| `--save-path` | `trained_model_checkpoints` | Checkpoint directory |
|
|
|
|
### vLLM Server Options
|
|
|
|
| Option | Description |
|
|
|--------|-------------|
|
|
| `--model` | HuggingFace model ID |
|
|
| `--tensor-parallel-size` | Number of GPUs (use 1 for single-copy) |
|
|
| `--port` | Server port (default: 9001) |
|
|
| `--dtype` | Model dtype (`bfloat16`, `float16`, `auto`) |
|
|
|
|
---
|
|
|
|
## FAQ & Troubleshooting
|
|
|
|
### Q: I get "Could not find vllm_bridge_config.json"
|
|
|
|
**A:** vLLM didn't export the IPC handles. Check:
|
|
|
|
1. `VLLM_ENABLE_SHARED_WEIGHTS=1` was set **before** starting vLLM
|
|
2. Look for export messages in vllm.log:
|
|
```bash
|
|
grep "Exported" vllm.log
|
|
```
|
|
|
|
---
|
|
|
|
### Q: I get "CUDA out of memory" when starting the trainer
|
|
|
|
**A:** For single-copy mode, trainer and vLLM MUST be on the same GPU(s). Check:
|
|
|
|
```bash
|
|
# Both should use the same CUDA_VISIBLE_DEVICES
|
|
CUDA_VISIBLE_DEVICES=0 python ... vllm_api_server.py ...
|
|
CUDA_VISIBLE_DEVICES=0 python ... grpo.py ...
|
|
```
|
|
|
|
---
|
|
|
|
### Q: Trainer crashes with "Cannot copy out of meta tensor"
|
|
|
|
**A:** Some model buffers (like rotary embeddings) weren't initialized. This is a known issue being fixed. Update to the latest code.
|
|
|
|
---
|
|
|
|
### Q: Single-copy mode doesn't work with tensor-parallel > 1
|
|
|
|
**A:** Currently, single-copy mode only works with `tensor-parallel-size 1`. For larger models that need tensor parallelism, use a single GPU with a smaller model, or wait for multi-GPU single-copy support.
|
|
|
|
---
|
|
|
|
### Q: How do I check GPU memory usage?
|
|
|
|
**A:**
|
|
```bash
|
|
nvidia-smi
|
|
```
|
|
|
|
For single-copy mode with Qwen2.5-14B:
|
|
- GPU 0: ~28GB (shared between vLLM and trainer)
|
|
|
|
---
|
|
|
|
### Q: How do I stop all processes?
|
|
|
|
**A:**
|
|
```bash
|
|
pkill -9 -u $USER -f "vllm|grpo|python|run-api"
|
|
```
|
|
|
|
---
|
|
|
|
## Files in This Directory
|
|
|
|
| File | Description |
|
|
|------|-------------|
|
|
| `grpo.py` | Main trainer script with all modes |
|
|
| `vllm_api_server.py` | Custom vLLM server with shared memory patches |
|
|
| `vllm_patching/` | vLLM patches for CUDA IPC support |
|
|
| `requirements.txt` | Python dependencies |
|
|
| `README.md` | This documentation |
|
|
|
|
### vllm_patching/ Directory
|
|
|
|
| File | Description |
|
|
|------|-------------|
|
|
| `__init__.py` | Module exports and patch application |
|
|
| `patched_gpu_runner.py` | Patches GPUModelRunner to export IPC handles |
|
|
| `distributed_utils.py` | Distributed training utilities |
|
|
|
|
---
|
|
|
|
## Performance Comparison
|
|
|
|
| Mode | Sync Latency | Memory (14B model) | Best For |
|
|
|------|--------------|-------------------|----------|
|
|
| **Legacy** | 30-60s | 2x model | Debugging |
|
|
| **Single-Copy** | 0ms | 1x model (shared!) | Production |
|
|
| **LoRA** | 5-10s | 1x model + adapters | Memory-constrained |
|
|
|
|
---
|
|
|
|
## Checkpoint Locations
|
|
|
|
| Mode | Location | Size |
|
|
|------|----------|------|
|
|
| Legacy | `trained_model_checkpoints/step_N/` | ~28GB (14B model) |
|
|
| Single-Copy | `trained_model_checkpoints/step_N/` | ~28GB |
|
|
| LoRA | `trained_model_checkpoints/adapter_step_N/` | ~50MB |
|