mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

Jai Suphavadeeprasit 3de03d6db3 single copy

2026-02-13 11:26:25 -05:00

18 KiB

Raw Blame History

GRPO Example Trainer

This directory contains an example script (grpo.py) demonstrating how to integrate a custom training loop with the Atropos API for reinforcement learning using the GRPO (Group Relative Policy Optimization) algorithm.

Training Modes

The trainer supports three weight synchronization modes:

Mode	Description	Sync Latency	Best For
Legacy (`none`)	Save checkpoints, restart vLLM	~30-60 seconds	Simple setups, debugging
Shared vLLM (`shared_vllm`)	Direct shared memory updates via NCCL	~0 ms	Production, maximum throughput
LoRA (`lora_only`)	Train adapters, hot-swap	~1-5 seconds	Memory-constrained, fast iteration

Quick Start with GSM8k (Shared vLLM Mode)

This is the recommended production setup for maximum training throughput.

Prerequisites

# Install dependencies
pip install -r example_trainer/requirements.txt

# Install GSM8k environment dependencies
pip install datasets latex2sympy2_extended math_verify

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SHARED VLLM TRAINING ARCHITECTURE                        │
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────────────┐ │
│  │ GSM8k Env   │───▶│ Atropos API │◀───│ GRPO Trainer (GPU 2)            │ │
│  │ (problems)  │    │ (batching)  │    │ - Loads model for training      │ │
│  └─────────────┘    └─────────────┘    │ - Broadcasts weights via NCCL   │ │
│         │                              └─────────────────────────────────┘ │
│         │                                              │                    │
│         │                                              │ NCCL Broadcast     │
│         ▼                                              ▼                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │              vLLM Inference Server (GPUs 0-1)                        │   │
│  │         - Model weights in shared memory                             │   │
│  │         - Weight updater threads receive NCCL updates               │   │
│  │         - Generates rollouts for scoring                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Step-by-Step Guide (Tested & Working)

IMPORTANT: GPU Allocation

vLLM runs on GPUs 0-1 (tensor-parallel)
Trainer runs on GPU 2 (separate to avoid OOM)

Step 1: Kill Any Existing Processes

pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3

Step 2: Setup Directory

cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log

Step 3: Set Environment Variables

export VLLM_ENABLE_SHARED_WEIGHTS=1
export NUM_INFERENCE_NODES=0
export MASTER_ADDR=localhost
export MASTER_PORT=29500

Step 4: Start Atropos API

python -m atroposlib.cli.run_api > api.log 2>&1 &
echo "Atropos API started"
sleep 3

Step 5: Start GSM8K Environment

python environments/gsm8k_server.py > gsm8k.log 2>&1 &
echo "GSM8K environment started"
sleep 3

Step 6: Start vLLM Server on GPUs 0-1

CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-14B-Instruct \
    --tensor-parallel-size 2 \
    --port 9001 \
    --dtype bfloat16 \
    > vllm.log 2>&1 &
echo "vLLM starting on GPUs 0,1..."

Step 7: Wait for vLLM to Load

tail -f vllm.log

Wait until you see: Uvicorn running on http://0.0.0.0:9001

Then press Ctrl+C to stop tailing.

Step 8: Verify Shared Memory Setup

grep -E "thread|updater|Exported|Shared memory" vllm.log

You should see:

[vLLM Patch] ✓ Shared memory setup complete!
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP0)
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP1)

Step 9: Start Trainer on GPU 2

CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-14B-Instruct \
    --weight-bridge-mode shared_vllm \
    --vllm-port 9001 \
    --lr 1e-6 \
    --batch-size 4 \
    --training-steps 100 \
    --use-shared-memory \
    2>&1 | tee trainer.log

Step 10: Monitor Training

tail -f trainer.log

You should see:

[Bridge] ✓ Gloo group created
[Bridge] ✓ NCCL group created
[Bridge] ✓ All ranks synchronized and ready
[Bridge] Mapped 195/339 params from vLLM to trainer
Step 1/100

Quick Copy-Paste (All-in-One)

# Kill everything and setup
pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3
cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log

# Environment variables
export VLLM_ENABLE_SHARED_WEIGHTS=1 NUM_INFERENCE_NODES=0 MASTER_ADDR=localhost MASTER_PORT=29500

# Start services
python -m atroposlib.cli.run_api > api.log 2>&1 &
sleep 3
python environments/gsm8k_server.py > gsm8k.log 2>&1 &
sleep 3
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py --model Qwen/Qwen2.5-14B-Instruct --tensor-parallel-size 2 --port 9001 --dtype bfloat16 > vllm.log 2>&1 &

echo "Waiting for vLLM to load... (check: tail -f vllm.log)"
echo "Once ready, run the trainer command below:"
echo ""
echo "CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py --model-name Qwen/Qwen2.5-14B-Instruct --weight-bridge-mode shared_vllm --vllm-port 9001 --lr 1e-6 --batch-size 4 --training-steps 100 --use-shared-memory 2>&1 | tee trainer.log"

How Shared vLLM Mode Works

The Problem

Traditional RL training requires syncing model weights between the trainer and inference server. This is slow:

Save checkpoint → Load into vLLM → Restart server = 30-60 seconds per sync

Two Solutions Available

Option 1: Broadcast Mode (`--use-shared-memory`)

Two copies of the model, but instant NCCL sync. Use when trainer is on different GPUs.

Trainer (GPU 2)              NCCL               vLLM Workers (GPUs 0-1)
     │                         │                        │
     │ optimizer.step()        │                        │
     │ ─────────────────────────────────────────────►   │
     │   broadcast_weights()   │                        │ Thread receives
     │                         │                        │ weights via NCCL
     │                         │                        │ Copies to shared
     │                         │                        │ memory tensors
     │                         │                        │
     │ Next training step      │                        │ Ready for inference

Memory: 2x model size (trainer copy + vLLM copy)
Sync Latency: ~0ms (NCCL broadcast)
GPU Layout: Trainer on different GPUs than vLLM

Option 2: Single-Copy Mode (`--single-copy`) ⭐ RECOMMENDED

TRUE shared memory - only ONE copy of the model! Use when trainer is on same GPUs.

┌────────────────────────────────────────────────────────────┐
│                    SAME GPU(s)                             │
│                                                            │
│     ┌──────────────────────────────────────────────────┐  │
│     │         SHARED MODEL TENSORS                      │  │
│     │      (only ONE copy in GPU memory!)               │  │
│     └──────────────────────────────────────────────────┘  │
│              ▲                           ▲                 │
│              │ Reads/Writes              │ Reads           │
│     ┌────────┴───────┐          ┌────────┴───────┐        │
│     │    Trainer     │          │     vLLM       │        │
│     │  (gradients)   │          │  (inference)   │        │
│     └────────────────┘          └────────────────┘        │
│              │                                             │
│              │ optimizer.step()                            │
│              │ (updates shared tensors in-place)           │
│              ▼                                             │
│     vLLM immediately sees new weights!                     │
└────────────────────────────────────────────────────────────┘

Memory: 1x model size (truly shared via CUDA IPC!)
Sync Latency: 0ms (same memory, no copy needed)
GPU Layout: Trainer on SAME GPUs as vLLM (required!)

When to Use Which

Mode	Memory	Sync	Use When
Broadcast (`--use-shared-memory`)	2x model	~0ms NCCL	Trainer on different GPUs
Single-Copy (`--single-copy`)	1x model	0ms	Trainer on same GPUs, memory constrained

Single-Copy Mode Usage

# vLLM and Trainer on SAME GPUs (0,1)
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-14B-Instruct \
    --tensor-parallel-size 2 \
    --port 9001 \
    > vllm.log 2>&1 &

# Wait for vLLM to load...

# Trainer also on GPUs 0,1 - shares vLLM's tensors!
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-14B-Instruct \
    --weight-bridge-mode shared_vllm \
    --single-copy \
    --training-steps 100 \
    2>&1 | tee trainer.log

Alternative Modes

Mode 1: Legacy (Checkpoint + Restart)

For simple setups or debugging. Saves checkpoints and can restart vLLM.

python example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode none \
    --training-steps 100 \
    --vllm-restart-interval 10 \
    --batch-size 2 \
    --lr 1e-5

Mode 2: LoRA Adapters

Trains only adapter weights. Small checkpoints, lower memory.

python example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode lora_only \
    --lora-r 16 \
    --lora-alpha 32 \
    --training-steps 100 \
    --batch-size 2 \
    --lr 1e-4

Configuration Reference

Environment Variables

Variable	Required	Description	Example
`VLLM_ENABLE_SHARED_WEIGHTS`	Yes (shared mode)	Enable vLLM patching	`1`
`NUM_INFERENCE_NODES`	Yes	Number of vLLM nodes (0 = local)	`0`
`MASTER_ADDR`	Yes	Rendezvous address	`localhost`
`MASTER_PORT`	Yes	Rendezvous port	`29500`
`CUDA_VISIBLE_DEVICES`	Recommended	GPU allocation	`0,1` or `2`

Trainer CLI Options

Option	Default	Description
`--model-name`	(required)	HuggingFace model ID
`--weight-bridge-mode`	`none`	`none`, `shared_vllm`, or `lora_only`
`--use-shared-memory`	`False`	Enable NCCL weight broadcasting
`--vllm-port`	`9001`	vLLM server port
`--training-steps`	`10`	Total optimization steps
`--batch-size`	`2`	Micro-batch size
`--lr`	`1e-5`	Learning rate
`--save-path`	`trained_model_checkpoints`	Checkpoint directory

vLLM Server Options

Option	Description
`--model`	HuggingFace model ID
`--tensor-parallel-size`	Number of GPUs for tensor parallelism
`--port`	Server port (default: 9001)
`--dtype`	Model dtype (`bfloat16`, `float16`, `auto`)

FAQ & Troubleshooting

Q: The trainer is stuck at "Creating Gloo process group..."

A: This means the trainer is waiting for the vLLM weight updater threads to connect. Check if the threads started:

grep -E "thread|updater|ERROR" vllm.log

You should see:

[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP0)
[vLLM Patch] ✓ Weight updater thread started (name: WeightUpdater_TP1)

If not, ensure VLLM_ENABLE_SHARED_WEIGHTS=1 was set before starting vLLM.

Q: I get "CUDA out of memory" when starting the trainer

A: The trainer is trying to load the model on the same GPUs as vLLM. Use separate GPUs:

# vLLM on GPUs 0-1
CUDA_VISIBLE_DEVICES=0,1 python -u example_trainer/vllm_api_server.py ...

# Trainer on GPU 2
CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py ...

Q: I see "daemonic processes are not allowed to have children"

A: This was a bug in older versions. The fix uses threads instead of processes for the weight updater. Make sure you have the latest patched_gpu_runner.py.

Q: The `vllm_bridge_config.json` shows `param_mappings: {}`

A: The vLLM patches didn't run. Check:

VLLM_ENABLE_SHARED_WEIGHTS=1 was set before starting vLLM
Look for [vLLM Patch] ✓ Exported X params in vllm.log

grep "Exported" vllm.log

Q: How do I verify the NCCL connection is working?

A: Check the trainer log for these messages:

[Bridge] ✓ Gloo group created
[Bridge] ✓ NCCL group created
[Bridge] ✓ All ranks synchronized and ready

Q: What's the difference between Gloo and NCCL?

Gloo: CPU-based coordination protocol. Used for synchronization barriers.
NCCL: GPU-based high-speed protocol. Used for broadcasting weight tensors.

Both are needed: Gloo for coordination, NCCL for fast tensor transfers.

Q: How do I check GPU memory usage?

nvidia-smi

Expected for Qwen2.5-14B with shared mode:

GPUs 0-1: ~168GB each (vLLM workers)
GPU 2: ~29GB (trainer)

Q: How do I stop all processes?

pkill -9 -u $USER -f "vllm|grpo|python|run-api"

Q: The training is slow / not progressing

A: Check if all services are running:

ps aux | grep -E "(run_api|vllm|grpo|gsm8k)" | grep $USER

Check logs for errors:

tail -20 api.log
tail -20 gsm8k.log
tail -20 vllm.log
tail -20 trainer.log

Q: How do I use a smaller model for testing?

A: Use Qwen2.5-3B-Instruct with single GPU:

# vLLM on GPU 0
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-3B-Instruct \
    --port 9001 \
    > vllm.log 2>&1 &

# Trainer on GPU 1
CUDA_VISIBLE_DEVICES=1 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode shared_vllm \
    --use-shared-memory \
    --training-steps 10 \
    2>&1 | tee trainer.log

Files in This Directory

File	Description
`grpo.py`	Main trainer script with all modes
`vllm_api_server.py`	Custom vLLM server with shared memory patches
`vllm_weight_bridge.py`	NCCL bridge for weight synchronization
`vllm_patching/`	vLLM patches for shared memory support
`requirements.txt`	Python dependencies
`README.md`	This documentation

vllm_patching/ Directory

File	Description
`__init__.py`	Module exports
`patched_gpu_runner.py`	Patches GPUModelRunner for shared memory
`weight_updater.py`	Thread that receives NCCL weight broadcasts
`distributed_utils.py`	Process group initialization helpers

Performance Comparison

Mode	Sync Latency	Memory (14B model)	Best For
Legacy	30-60s	2x model	Debugging
Shared vLLM	~0ms	1x model (shared) + trainer	Production
LoRA	5-10s	1x model + adapters	Memory-constrained

Checkpoint Locations

Mode	Location	Size
Legacy	`trained_model_checkpoints/step_N/`	~28GB (14B model)
Shared vLLM	`trained_model_checkpoints/step_N/`	~28GB
LoRA	`trained_model_checkpoints/adapter_step_N/`	~50MB

Example Training Runs

Quick Test (3B model, LoRA)

python example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode lora_only \
    --training-steps 5 \
    --batch-size 1

Production (14B model, Shared vLLM)

# See Step-by-Step Guide above
CUDA_VISIBLE_DEVICES=2 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-14B-Instruct \
    --weight-bridge-mode shared_vllm \
    --use-shared-memory \
    --training-steps 1000 \
    --batch-size 4 \
    --lr 1e-6

Multi-GPU Training (70B model)

# vLLM on GPUs 0-3 (tensor parallel 4)
CUDA_VISIBLE_DEVICES=0,1,2,3 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 4 \
    --port 9001 \
    > vllm.log 2>&1 &

# Trainer on GPUs 4-5
CUDA_VISIBLE_DEVICES=4,5 python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-72B-Instruct \
    --weight-bridge-mode shared_vllm \
    --use-shared-memory \
    --training-steps 100 \
    2>&1 | tee trainer.log

18 KiB Raw Blame History

GRPO Example Trainer

Training Modes

Quick Start with GSM8k (Shared vLLM Mode)

Prerequisites

Architecture Overview

Step-by-Step Guide (Tested & Working)

Step 1: Kill Any Existing Processes

Step 2: Setup Directory

Step 3: Set Environment Variables

Step 4: Start Atropos API

Step 5: Start GSM8K Environment

Step 6: Start vLLM Server on GPUs 0-1

Step 7: Wait for vLLM to Load

Step 8: Verify Shared Memory Setup

Step 9: Start Trainer on GPU 2

Step 10: Monitor Training

Quick Copy-Paste (All-in-One)

How Shared vLLM Mode Works

The Problem

Two Solutions Available

Option 1: Broadcast Mode (--use-shared-memory)

Option 2: Single-Copy Mode (--single-copy) ⭐ RECOMMENDED

When to Use Which

Single-Copy Mode Usage

Alternative Modes

Mode 1: Legacy (Checkpoint + Restart)

Mode 2: LoRA Adapters

Configuration Reference

Environment Variables

Trainer CLI Options

vLLM Server Options

FAQ & Troubleshooting

Q: The trainer is stuck at "Creating Gloo process group..."

Q: I get "CUDA out of memory" when starting the trainer

Q: I see "daemonic processes are not allowed to have children"

Q: The vllm_bridge_config.json shows param_mappings: {}

Q: How do I verify the NCCL connection is working?

Q: What's the difference between Gloo and NCCL?

Q: How do I check GPU memory usage?

Q: How do I stop all processes?

Q: The training is slow / not progressing

Q: How do I use a smaller model for testing?

Files in This Directory

vllm_patching/ Directory

Performance Comparison

Checkpoint Locations

Example Training Runs

Quick Test (3B model, LoRA)

Production (14B model, Shared vLLM)

Multi-GPU Training (70B model)

18 KiB

Raw Blame History

Option 1: Broadcast Mode (`--use-shared-memory`)

Option 2: Single-Copy Mode (`--single-copy`) ⭐ RECOMMENDED

Q: The `vllm_bridge_config.json` shows `param_mappings: {}`