mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

Jai Suphavadeeprasit 01af0777bc readme update

2026-03-02 11:18:52 -05:00

13 KiB

Raw Blame History

GRPO Example Trainer

This directory contains an example script (grpo.py) demonstrating how to integrate a custom training loop with the Atropos API for reinforcement learning using the GRPO (Group Relative Policy Optimization) algorithm.

Training Modes

The trainer supports three weight synchronization modes:

Mode	Description	Sync Latency	Best For
Legacy (`none`)	Save checkpoints, restart vLLM	~30-60 seconds	Simple setups, debugging
Single-Copy (`shared_vllm`)	Direct CUDA IPC - ONE model copy!	0 ms	Production, memory efficiency
LoRA (`lora_only`)	Train adapters, hot-swap	~1-5 seconds	Memory-constrained, fast iteration

Quick Start with GSM8k (Single-Copy Mode)

This is the recommended production setup for maximum training throughput and memory efficiency.

Prerequisites

# Install dependencies
pip install -r example_trainer/requirements.txt

# Install GSM8k environment dependencies
pip install datasets latex2sympy2_extended math_verify

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SINGLE-COPY TRAINING ARCHITECTURE                         │
│                                                                              │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────────────┐  │
│  │ GSM8k Env   │───▶│ Atropos API │◀───│ GRPO Trainer                    │  │
│  │ (problems)  │    │ (batching)  │    │ - Attached to vLLM's tensors    │  │
│  └─────────────┘    └─────────────┘    │ - optimizer.step() updates both │  │
│         │                              └─────────────────────────────────┘  │
│         │                                              │                     │
│         │                                              │ CUDA IPC            │
│         │                                              │ (same memory!)      │
│         ▼                                              ▼                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │              vLLM Inference Server (GPU 0)                           │    │
│  │         - Model weights in GPU memory                                │    │
│  │         - Trainer sees same tensors via IPC                         │    │
│  │         - Generates rollouts for scoring                            │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

How Single-Copy Mode Works

┌────────────────────────────────────────────────────────────┐
│                    SAME GPU(s)                             │
│                                                            │
│     ┌──────────────────────────────────────────────────┐  │
│     │         SHARED MODEL TENSORS                      │  │
│     │      (only ONE copy in GPU memory!)               │  │
│     └──────────────────────────────────────────────────┘  │
│              ▲                           ▲                 │
│              │ Reads/Writes              │ Reads           │
│     ┌────────┴───────┐          ┌────────┴───────┐        │
│     │    Trainer     │          │     vLLM       │        │
│     │  (gradients)   │          │  (inference)   │        │
│     └────────────────┘          └────────────────┘        │
│              │                                             │
│              │ optimizer.step()                            │
│              │ (updates shared tensors in-place)           │
│              ▼                                             │
│     vLLM immediately sees new weights!                     │
└────────────────────────────────────────────────────────────┘

Memory: 1x model size (truly shared via CUDA IPC!)
Sync Latency: 0ms (same memory, no copy needed)
Requirement: Trainer and vLLM on SAME GPU(s)

Step-by-Step Guide

IMPORTANT: GPU Allocation

vLLM and Trainer run on the SAME GPU(s)
Use tensor-parallel-size 1 for single-copy mode (TP>1 not yet supported)

Step 1: Kill Any Existing Processes

pkill -9 -u $USER -f "vllm|grpo|python|run-api" 2>/dev/null; sleep 3

Step 2: Setup Directory

cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json vllm.log trainer.log api.log gsm8k.log

Step 3: Set Environment Variables

export VLLM_ENABLE_SHARED_WEIGHTS=1
export VLLM_SKIP_WEIGHT_DAEMON=1
export NUM_INFERENCE_NODES=0
export LOGDIR=.

Step 4: Start vLLM Server

CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-14B-Instruct \
    --tensor-parallel-size 1 \
    --port 9001 \
    > vllm.log 2>&1 &
echo "vLLM starting on GPU 0..."

Step 5: Wait for vLLM to Load

tail -f vllm.log

Wait until you see: Uvicorn running on http://0.0.0.0:9001

Then press Ctrl+C to stop tailing.

Step 6: Verify IPC Handles Exported

grep -E "IPC|Exported|single_copy" vllm.log

You should see:

[vLLM Patch] Exported X IPC handles for single-copy mode
[vLLM Patch] ✓ Exported 339 params to vllm_bridge_config.json

Step 7: Start GSM8K Environment

python environments/gsm8k_server.py serve \
    --slurm False \
    --openai.model_name Qwen/Qwen2.5-14B-Instruct \
    --openai.base_url http://localhost:9001/v1 \
    --openai.server_type vllm \
    --openai.api_key x \
    --env.tokenizer_name Qwen/Qwen2.5-14B-Instruct \
    --env.use_wandb False \
    > gsm8k.log 2>&1 &
echo "GSM8K environment started"
sleep 10

Step 8: Start Trainer (Same GPU as vLLM!)

CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-14B-Instruct \
    --weight-bridge-mode shared_vllm \
    --training-steps 100 \
    2>&1 | tee trainer.log

Step 9: Monitor Training

tail -f trainer.log

You should see:

[Setup] ✓ Attached 195 tensors to vLLM's shared memory
[Setup] ✓ Single-copy mode active - using vLLM's tensors directly!
[2/2] Starting training for 100 steps
Step 1/100
  [SINGLE-COPY] Weights updated in-place - step 1

Quick Copy-Paste (All-in-One)

# Kill everything and setup
pkill -9 -u $USER -f "vllm|grpo|python" 2>/dev/null; sleep 3
cd ~/atropos_stuff/atropos
rm -f vllm_bridge_config.json *.log

# Environment variables
export VLLM_ENABLE_SHARED_WEIGHTS=1 VLLM_SKIP_WEIGHT_DAEMON=1 NUM_INFERENCE_NODES=0 LOGDIR=.

# Start vLLM
CUDA_VISIBLE_DEVICES=0 python -u example_trainer/vllm_api_server.py \
    --model Qwen/Qwen2.5-14B-Instruct --tensor-parallel-size 1 --port 9001 > vllm.log 2>&1 &
echo "Waiting 90s for vLLM..."; sleep 90

# Start GSM8k environment
python environments/gsm8k_server.py serve --slurm False \
    --openai.model_name Qwen/Qwen2.5-14B-Instruct \
    --openai.base_url http://localhost:9001/v1 \
    --openai.server_type vllm --openai.api_key x \
    --env.tokenizer_name Qwen/Qwen2.5-14B-Instruct \
    --env.use_wandb False > gsm8k.log 2>&1 &
sleep 10

# Start trainer (same GPU!)
CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-14B-Instruct \
    --weight-bridge-mode shared_vllm \
    --training-steps 100 \
    2>&1 | tee trainer.log

Alternative Modes

Mode 1: Legacy (Checkpoint + Restart)

For simple setups or debugging. Saves checkpoints and restarts vLLM to load new weights.

python example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode none \
    --training-steps 100 \
    --vllm-restart-interval 10 \
    --batch-size 2 \
    --lr 1e-5

Mode 2: LoRA Adapters

Trains only adapter weights. Small checkpoints, lower memory.

python example_trainer/grpo.py \
    --model-name Qwen/Qwen2.5-3B-Instruct \
    --weight-bridge-mode lora_only \
    --lora-r 16 \
    --lora-alpha 32 \
    --training-steps 100 \
    --batch-size 2 \
    --lr 1e-4

Configuration Reference

Environment Variables

Variable	Required	Description	Example
`VLLM_ENABLE_SHARED_WEIGHTS`	Yes (single-copy)	Enable vLLM patching for IPC	`1`
`VLLM_SKIP_WEIGHT_DAEMON`	Yes (single-copy)	Skip NCCL daemon (not needed)	`1`
`NUM_INFERENCE_NODES`	Yes	Number of vLLM nodes (0 = local)	`0`
`LOGDIR`	Recommended	Directory for vllm_bridge_config.json	`.`
`CUDA_VISIBLE_DEVICES`	Recommended	GPU allocation	`0`

Trainer CLI Options

Option	Default	Description
`--model-name`	(required)	HuggingFace model ID
`--weight-bridge-mode`	`none`	`none`, `shared_vllm`, or `lora_only`
`--vllm-port`	`9001`	vLLM server port
`--training-steps`	`10`	Total optimization steps
`--batch-size`	`2`	Micro-batch size
`--lr`	`1e-5`	Learning rate
`--save-path`	`trained_model_checkpoints`	Checkpoint directory

vLLM Server Options

Option	Description
`--model`	HuggingFace model ID
`--tensor-parallel-size`	Number of GPUs (use 1 for single-copy)
`--port`	Server port (default: 9001)
`--dtype`	Model dtype (`bfloat16`, `float16`, `auto`)

FAQ & Troubleshooting

Q: I get "Could not find vllm_bridge_config.json"

A: vLLM didn't export the IPC handles. Check:

VLLM_ENABLE_SHARED_WEIGHTS=1 was set before starting vLLM
Look for export messages in vllm.log:

grep "Exported" vllm.log

Q: I get "CUDA out of memory" when starting the trainer

A: For single-copy mode, trainer and vLLM MUST be on the same GPU(s). Check:

# Both should use the same CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0 python ... vllm_api_server.py ...
CUDA_VISIBLE_DEVICES=0 python ... grpo.py ...

Q: Trainer crashes with "Cannot copy out of meta tensor"

A: Some model buffers (like rotary embeddings) weren't initialized. This is a known issue being fixed. Update to the latest code.

Q: Single-copy mode doesn't work with tensor-parallel > 1

A: Currently, single-copy mode only works with tensor-parallel-size 1. For larger models that need tensor parallelism, use a single GPU with a smaller model, or wait for multi-GPU single-copy support.

Q: How do I check GPU memory usage?

nvidia-smi

For single-copy mode with Qwen2.5-14B:

GPU 0: ~28GB (shared between vLLM and trainer)

Q: How do I stop all processes?

pkill -9 -u $USER -f "vllm|grpo|python|run-api"

Files in This Directory

File	Description
`grpo.py`	Main trainer script with all modes
`vllm_api_server.py`	Custom vLLM server with shared memory patches
`vllm_patching/`	vLLM patches for CUDA IPC support
`requirements.txt`	Python dependencies
`README.md`	This documentation

vllm_patching/ Directory

File	Description
`__init__.py`	Module exports and patch application
`patched_gpu_runner.py`	Patches GPUModelRunner to export IPC handles
`distributed_utils.py`	Distributed training utilities

Performance Comparison

Mode	Sync Latency	Memory (14B model)	Best For
Legacy	30-60s	2x model	Debugging
Single-Copy	0ms	1x model (shared!)	Production
LoRA	5-10s	1x model + adapters	Memory-constrained

Checkpoint Locations

Mode	Location	Size
Legacy	`trained_model_checkpoints/step_N/`	~28GB (14B model)
Single-Copy	`trained_model_checkpoints/step_N/`	~28GB
LoRA	`trained_model_checkpoints/adapter_step_N/`	~50MB

13 KiB Raw Blame History