atropos/example_trainer/README.md

# GRPO Example Trainer

This directory contains an example script (`grpo.py`) demonstrating how to integrate a custom training loop with the Atropos API for reinforcement learning using the GRPO (Group Relative Policy Optimization) algorithm.

## Training Modes

The trainer supports three weight synchronization modes:

| Mode | Description | Sync Latency | Best For |
|------|-------------|--------------|----------|
| **Legacy** (`none`) | Save checkpoints, restart vLLM | ~30-60 seconds | Simple setups, debugging |
| **Shared vLLM** (`shared_vllm`) | Direct shared memory updates via NCCL | ~0 ms | Production, maximum throughput |
| **LoRA** (`lora_only`) | Train adapters, hot-swap | ~1-5 seconds | Memory-constrained, fast iteration |

---

## Quick Start with GSM8k (Shared vLLM Mode)

This is the **recommended** production setup for maximum training throughput.

### Prerequisites

```bash
# Install dependencies
pip install -r example_trainer/requirements.txt

# Install GSM8k environment dependencies
pip install datasets latex2sympy2_extended math_verify
```

### Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    SHARED VLLM TRAINING ARCHITECTURE                        │
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────────────┐ │
│  │ GSM8k Env   │───▶│ Atropos API │◀───│ GRPO Trainer (GPU 2)            │ │
│  │ (problems)  │    │ (batching)  │    │ - Loads model for training      │ │
│  └─────────────┘    └─────────────┘    │ - Broadcasts weights via NCCL   │ │
│         │                              └─────────────────────────────────┘ │
│         │                                              │                    │
│         │                                              │ NCCL Broadcast     │
│         ▼                                              ▼                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │              vLLM Inference Server (GPUs 0-1)                        │   │
│  │         - Model weights in shared memory                             │   │
│  │         - Weight updater threads receive NCCL updates               │   │
│  │         - Generates rollouts for scoring                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Step-by-Step Guide (Tested & Working)

**IMPORTANT: GPU Allocation**
- vLLM runs on GPUs 0-1 (tensor-parallel)
- Trainer runs on GPU 2 (separate to avoid OOM)

---

#### Step 1: Kill Any Existing Processes

```bash
pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3
```

#### Step 2: Setup Directory

```bash
cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log
```

#### Step 3: Set Environment Variables

```bash
export VLLM_ENABLE_SHARED_WEIGHTS=1
export NUM_INFERENCE_NODES=0
export MASTER_ADDR=localhost
export MASTER_PORT=29500
```

#### Step 4: Start Atropos API

```bash
python -m atroposlib.cli.run_api > api.log 2>&1 &
echo "Atropos API started"
sleep 3
```

#### Step 5: Start GSM8K Environment

```bash
python environments/gsm8k_server.py > gsm8k.log 2>&1 &
echo "GSM8K environment started"
sleep 3
```

#### Step 6: Start vLLM Server on GPUs 0-1

```bash
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-14B-Instruct \
    --tensor-parallel-size 2 \
    --port 9001 \
    --dtype bfloat16 \
    > vllm.log 2>&1 &
echo "vLLM starting on GPUs 0,1..."
```

#### Step 7: Wait for vLLM to Load

```bash
tail -f vllm.log
```

Wait until you see: `Uvicorn running on http://0.0.0.0:9001`

Then press **Ctrl+C** to stop tailing.

#### Step 8: Verify Shared Memory Setup

```bash
grep -E "thread|updater|Exported|Shared memory" vllm.log
```

You should see:
```
[vLLM Patch] ✓ Shared memory setup complete!
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP0)
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP1)
```

#### Step 9: Start Trainer on GPU 2

```bash
CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-14B-Instruct \
    --weight-bridge-mode shared_vllm \
    --vllm-port 9001 \
    --lr 1e-6 \
    --batch-size 4 \
    --training-steps 100 \
    --use-shared-memory \
    2>&1 | tee trainer.log
```

#### Step 10: Monitor Training

```bash
tail -f trainer.log
```

You should see:
```
[Bridge] ✓ Gloo group created
[Bridge] ✓ NCCL group created
[Bridge] ✓ All ranks synchronized and ready
[Bridge] Mapped 195/339 params from vLLM to trainer
Step 1/100
```

---

### Quick Copy-Paste (All-in-One)

```bash
# Kill everything and setup
pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3
cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log

# Environment variables
export VLLM_ENABLE_SHARED_WEIGHTS=1 NUM_INFERENCE_NODES=0 MASTER_ADDR=localhost MASTER_PORT=29500

# Start services
python -m atroposlib.cli.run_api > api.log 2>&1 &
sleep 3
python environments/gsm8k_server.py > gsm8k.log 2>&1 &
sleep 3
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py --model Qwen/Qwen2.5-14B-Instruct --tensor-parallel-size 2 --port 9001 --dtype bfloat16 > vllm.log 2>&1 &

echo "Waiting for vLLM to load... (check: tail -f vllm.log)"
echo "Once ready, run the trainer command below:"
echo ""
echo "CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py --model-name Qwen/Qwen2.5-14B-Instruct --weight-bridge-mode shared_vllm --vllm-port 9001 --lr 1e-6 --batch-size 4 --training-steps 100 --use-shared-memory 2>&1 | tee trainer.log"
```

---

## How Shared vLLM Mode Works

### The Problem
Traditional RL training requires syncing model weights between the trainer and inference server. This is slow:
- Save checkpoint → Load into vLLM → Restart server = **30-60 seconds per sync**

### Two Solutions Available

#### Option 1: Broadcast Mode (`--use-shared-memory`)
Two copies of the model, but instant NCCL sync. Use when trainer is on **different GPUs**.

```
Trainer (GPU 2)              NCCL               vLLM Workers (GPUs 0-1)
     │                         │                        │
     │ optimizer.step()        │                        │
     │ ─────────────────────────────────────────────►   │
     │   broadcast_weights()   │                        │ Thread receives
     │                         │                        │ weights via NCCL
     │                         │                        │ Copies to shared
     │                         │                        │ memory tensors
     │                         │                        │
     │ Next training step      │                        │ Ready for inference
```

- **Memory**: 2x model size (trainer copy + vLLM copy)
- **Sync Latency**: ~0ms (NCCL broadcast)
- **GPU Layout**: Trainer on different GPUs than vLLM

#### Option 2: Single-Copy Mode (`--single-copy`) ⭐ RECOMMENDED
TRUE shared memory - only ONE copy of the model! Use when trainer is on **same GPUs**.

```
┌────────────────────────────────────────────────────────────┐
│                    SAME GPU(s)                             │
│                                                            │
│     ┌──────────────────────────────────────────────────┐  │
│     │         SHARED MODEL TENSORS                      │  │
│     │      (only ONE copy in GPU memory!)               │  │
│     └──────────────────────────────────────────────────┘  │
│              ▲                           ▲                 │
│              │ Reads/Writes              │ Reads           │
│     ┌────────┴───────┐          ┌────────┴───────┐        │
│     │    Trainer     │          │     vLLM       │        │
│     │  (gradients)   │          │  (inference)   │        │
│     └────────────────┘          └────────────────┘        │
│              │                                             │
│              │ optimizer.step()                            │
│              │ (updates shared tensors in-place)           │
│              ▼                                             │
│     vLLM immediately sees new weights!                     │
└────────────────────────────────────────────────────────────┘
```

- **Memory**: 1x model size (truly shared via CUDA IPC!)
- **Sync Latency**: 0ms (same memory, no copy needed)
- **GPU Layout**: Trainer on SAME GPUs as vLLM (required!)

### When to Use Which

| Mode | Memory | Sync | Use When |
|------|--------|------|----------|
| **Broadcast** (`--use-shared-memory`) | 2x model | ~0ms NCCL | Trainer on different GPUs |
| **Single-Copy** (`--single-copy`) | 1x model | 0ms | Trainer on same GPUs, memory constrained |

### Single-Copy Mode Usage

```bash
# vLLM and Trainer on SAME GPUs (0,1)
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-14B-Instruct \
    --tensor-parallel-size 2 \
    --port 9001 \
    > vllm.log 2>&1 &

# Wait for vLLM to load...

# Trainer also on GPUs 0,1 - shares vLLM's tensors!
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-14B-Instruct \
    --weight-bridge-mode shared_vllm \
    --single-copy \
    --training-steps 100 \
    2>&1 | tee trainer.log
```

---

## Alternative Modes

### Mode 1: Legacy (Checkpoint + Restart)

For simple setups or debugging. Saves checkpoints and can restart vLLM.

```bash
python example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode none \
    --training-steps 100 \
    --vllm-restart-interval 10 \
    --batch-size 2 \
    --lr 1e-5
```

### Mode 2: LoRA Adapters

Trains only adapter weights. Small checkpoints, lower memory.

```bash
python example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode lora_only \
    --lora-r 16 \
    --lora-alpha 32 \
    --training-steps 100 \
    --batch-size 2 \
    --lr 1e-4
```

---

## Configuration Reference

### Environment Variables

| Variable | Required | Description | Example |
|----------|----------|-------------|---------|
| `VLLM_ENABLE_SHARED_WEIGHTS` | Yes (shared mode) | Enable vLLM patching | `1` |
| `NUM_INFERENCE_NODES` | Yes | Number of vLLM nodes (0 = local) | `0` |
| `MASTER_ADDR` | Yes | Rendezvous address | `localhost` |
| `MASTER_PORT` | Yes | Rendezvous port | `29500` |
| `CUDA_VISIBLE_DEVICES` | Recommended | GPU allocation | `0,1` or `2` |

### Trainer CLI Options

| Option | Default | Description |
|--------|---------|-------------|
| `--model-name` | (required) | HuggingFace model ID |
| `--weight-bridge-mode` | `none` | `none`, `shared_vllm`, or `lora_only` |
| `--use-shared-memory` | `False` | Enable NCCL weight broadcasting |
| `--vllm-port` | `9001` | vLLM server port |
| `--training-steps` | `10` | Total optimization steps |
| `--batch-size` | `2` | Micro-batch size |
| `--lr` | `1e-5` | Learning rate |
| `--save-path` | `trained_model_checkpoints` | Checkpoint directory |

### vLLM Server Options

| Option | Description |
|--------|-------------|
| `--model` | HuggingFace model ID |
| `--tensor-parallel-size` | Number of GPUs for tensor parallelism |
| `--port` | Server port (default: 9001) |
| `--dtype` | Model dtype (`bfloat16`, `float16`, `auto`) |

---

## FAQ & Troubleshooting

### Q: The trainer is stuck at "Creating Gloo process group..."

**A:** This means the trainer is waiting for the vLLM weight updater threads to connect. Check if the threads started:

```bash
grep -E "thread|updater|ERROR" vllm.log
```

You should see:
```
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP0)
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP1)
```

If not, ensure `VLLM_ENABLE_SHARED_WEIGHTS=1` was set **before** starting vLLM.

---

### Q: I get "CUDA out of memory" when starting the trainer

**A:** The trainer is trying to load the model on the same GPUs as vLLM. Use separate GPUs:

```bash
# vLLM on GPUs 0-1
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py ...

# Trainer on GPU 2
CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py ...
```

---

### Q: I see "daemonic processes are not allowed to have children"

**A:** This was a bug in older versions. The fix uses **threads** instead of **processes** for the weight updater. Make sure you have the latest `patched_gpu_runner.py`.

---

### Q: The `vllm_bridge_config.json` shows `param_mappings: {}`

**A:** The vLLM patches didn't run. Check:

1. `VLLM_ENABLE_SHARED_WEIGHTS=1` was set before starting vLLM
2. Look for `[vLLM Patch] ✓ Exported X params` in vllm.log

```bash
grep "Exported" vllm.log
```

---

### Q: How do I verify the NCCL connection is working?

**A:** Check the trainer log for these messages:

```
[Bridge] ✓ Gloo group created
[Bridge] ✓ NCCL group created
[Bridge] ✓ All ranks synchronized and ready
```

---

### Q: What's the difference between Gloo and NCCL?

**A:**
- **Gloo**: CPU-based coordination protocol. Used for synchronization barriers.
- **NCCL**: GPU-based high-speed protocol. Used for broadcasting weight tensors.

Both are needed: Gloo for coordination, NCCL for fast tensor transfers.

---

### Q: How do I check GPU memory usage?

**A:**
```bash
nvidia-smi
```

Expected for Qwen2.5-14B with shared mode:
- GPUs 0-1: ~168GB each (vLLM workers)
- GPU 2: ~29GB (trainer)

---

### Q: How do I stop all processes?

**A:**
```bash
pkill -9 -u $USER -f "vllm|grpo|python|run-api"
```

---

### Q: The training is slow / not progressing

**A:** Check if all services are running:

```bash
ps aux | grep -E "(run_api|vllm|grpo|gsm8k)" | grep $USER
```

Check logs for errors:
```bash
tail -20 api.log
tail -20 gsm8k.log
tail -20 vllm.log
tail -20 trainer.log
```

---

### Q: How do I use a smaller model for testing?

**A:** Use Qwen2.5-3B-Instruct with single GPU:

```bash
# vLLM on GPU 0
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-3B-Instruct \
    --port 9001 \
    > vllm.log 2>&1 &

# Trainer on GPU 1
CUDA_VISIBLE_DEVICES=1 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode shared_vllm \
    --use-shared-memory \
    --training-steps 10 \
    2>&1 | tee trainer.log
```

---

## Files in This Directory

| File | Description |
|------|-------------|
| `grpo.py` | Main trainer script with all modes |
| `vllm_api_server.py` | Custom vLLM server with shared memory patches |
| `vllm_weight_bridge.py` | NCCL bridge for weight synchronization |
| `vllm_patching/` | vLLM patches for shared memory support |
| `requirements.txt` | Python dependencies |
| `README.md` | This documentation |

### vllm_patching/ Directory

| File | Description |
|------|-------------|
| `__init__.py` | Module exports |
| `patched_gpu_runner.py` | Patches GPUModelRunner for shared memory |
| `weight_updater.py` | Thread that receives NCCL weight broadcasts |
| `distributed_utils.py` | Process group initialization helpers |

---

## Performance Comparison

| Mode | Sync Latency | Memory (14B model) | Best For |
|------|--------------|-------------------|----------|
| **Legacy** | 30-60s | 2x model | Debugging |
| **Shared vLLM** | ~0ms | 1x model (shared) + trainer | Production |
| **LoRA** | 5-10s | 1x model + adapters | Memory-constrained |

---

## Checkpoint Locations

| Mode | Location | Size |
|------|----------|------|
| Legacy | `trained_model_checkpoints/step_N/` | ~28GB (14B model) |
| Shared vLLM | `trained_model_checkpoints/step_N/` | ~28GB |
| LoRA | `trained_model_checkpoints/adapter_step_N/` | ~50MB |

---

## Example Training Runs

### Quick Test (3B model, LoRA)
```bash
python example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode lora_only \
    --training-steps 5 \
    --batch-size 1
```

### Production (14B model, Shared vLLM)
```bash
# See Step-by-Step Guide above
CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-14B-Instruct \
    --weight-bridge-mode shared_vllm \
    --use-shared-memory \
    --training-steps 1000 \
    --batch-size 4 \
    --lr 1e-6
```

### Multi-GPU Training (70B model)
```bash
# vLLM on GPUs 0-3 (tensor parallel 4)
CUDA_VISIBLE_DEVICES=0,1,2,3 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 4 \
    --port 9001 \
    > vllm.log 2>&1 &

# Trainer on GPUs 4-5
CUDA_VISIBLE_DEVICES=4,5 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-72B-Instruct \
    --weight-bridge-mode shared_vllm \
    --use-shared-memory \
    --training-steps 100 \
    2>&1 | tee trainer.log
```