keep debugging flags for future use

2026-04-19 12:57:58 +00:00 · 2026-01-19 14:19:22 -05:00 · 2026-01-19 14:19:22 -05:00 · f87455a712
commit f87455a712
parent d4589e1107
2 changed files with 314 additions and 88 deletions
--- a/example_trainer/README.md
+++ b/example_trainer/README.md
@ -188,7 +188,7 @@ You should see:
 [vLLM Patch] ✓ Exported 339 params to vllm_bridge_config.json
 ```

-#### Step 7: Start GSM8K Environment
+#### Step 7: Start an Environment (THE EXAMPLE HERE IS GSM8K in this case)

 ```bash
 python environments/gsm8k_server.py serve \
@ -475,15 +475,6 @@ CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
 │  └─────────────────────────────────────────────────────────────────────────────┘   │
 └─────────────────────────────────────────────────────────────────────────────────────┘
 ```
-
-**Key Points:**
- ✅ Simple to understand
- ✅ Works on any setup
- ✅ Good for debugging
- ⚠️ 30-60 second sync latency
- ⚠️ 2x GPU memory (trainer + vLLM)
- ⚠️ Large checkpoint files (~28GB each)
-
 ---

 ## Mode Comparison Summary
@ -498,7 +489,7 @@ CUDA_VISIBLE_DEVICES=0 LOGDIR=. python -u example_trainer/grpo.py \
 │ GPU Memory     │ 1x model      │ 2x model       │ 2x model                       │
 │ Disk Space     │ 28GB/ckpt     │ 50MB/adapter   │ 28GB/ckpt                      │
 │ Complexity     │ Medium        │ Medium         │ Simple                         │
-│ Same GPU?      │ Required ⚠️   │ Optional       │ Optional                       │
+│ Same GPU?      │ Required      │ Optional       │ Optional                       │
 │ Best For       │ Production    │ Experiments    │ Debugging                      │
 └────────────────┴───────────────┴────────────────┴────────────────────────────────┘
 ```
@ -820,3 +811,177 @@ pkill -9 -u $USER -f "vllm|grpo|python|run-api"
 | Legacy | `trained_model_checkpoints/step_N/` | ~28GB (14B model) |
 | Single-Copy | `trained_model_checkpoints/step_N/` | ~28GB |
 | LoRA | `trained_model_checkpoints/adapter_step_N/` | ~50MB |
+
+---
+
+## Feature Availability Matrix
+
+### What's Available 
+
+| Feature | Status | Notes |
+|---------|--------|-------|
+| **Single-Copy Mode** |  Working | True shared memory via CUDA IPC |
+| **LoRA Mode** |  Working | Hot-swap adapters without restart |
+| **Legacy Mode** |  Working | Checkpoint-based, restart vLLM |
+| **Qwen Models** |  Working | Qwen2, Qwen2.5 (0.5B - 72B) |
+| **Llama Models** |  Working | Llama-2, Llama-3, Llama-3.1 |
+| **Mistral Models** |  Working | Mistral-7B, Mixtral |
+| **Single GPU** |  Working | All modes supported |
+| **bfloat16/float16** |  Working | Configurable via `--dtype` |
+| **Gradient Checkpointing** |  Available | Reduces memory usage |
+| **Wandb Logging** |  Working | Via `--use-wandb` flag |
+| **Custom Environments** |  Working | Extend `BaseEnv` class |
+
+### What's NOT Available 
+
+| Feature | Mode | Status | Reason / Workaround |
+|---------|------|--------|---------------------|
+| **Multi-GPU (TP > 1)** | Single-Copy |  Not Supported | CUDA IPC handles are per-device; sharding complicates sharing |
+| **Multi-GPU (TP > 1)** | LoRA |  Supported | vLLM handles TP, trainer only swaps adapters |
+| **Multi-GPU (TP > 1)** | Legacy |  Supported | Standard vLLM with TP supported |
+| **Pipeline Parallel** | Single-Copy |  Not Supported | Would need cross-device IPC |
+| **Pipeline Parallel** | LoRA/Legacy |  Via vLLM | Use `--pipeline-parallel-size` flag |
+| **Data Parallel** | Single-Copy |  Not Supported | Shared tensors can't be safely updated by multiple trainers |
+| **Data Parallel** | LoRA/Legacy |  Manual | Run multiple trainer instances (see docs below) |
+| **Multi-Node** | Single-Copy |  Not Supported | CUDA IPC is single-node only |
+| **Multi-Node** | LoRA/Legacy |  Via vLLM | vLLM supports distributed inference |
+| **DeepSpeed/FSDP** | All |  Not Integrated | Would require custom integration with trainer |
+| **Quantized Models** | Single-Copy |  Not Supported | IPC handles may not work with quantized tensors |
+| **Quantized Models** | LoRA/Legacy |  Supported | Standard vLLM quantization (GPTQ, AWQ, etc.) |
+| **Encoder-Decoder** | All |  Not Supported | Architecture not supported by vLLM |
+
+### Multi-GPU Support Summary
+
+| Mode | Tensor Parallel | Pipeline Parallel | Data Parallel |
+|------|-----------------|-------------------|---------------|
+| **Single-Copy** |  TP=1 only |  Not Supported |  Not Supported |
+| **LoRA** |  Supported |  Via vLLM |  Multiple Trainers |
+| **Legacy** |  Supported |  Via vLLM |  Multiple Trainers |
+
+> **Key Point**: The multi-GPU limitation is **ONLY for single-copy mode** due to CUDA IPC constraints. 
+> LoRA and Legacy modes work with standard vLLM which fully supports tensor parallelism.
+
+#### Pipeline Parallel (PP)
+
+vLLM supports pipeline parallelism via `--pipeline-parallel-size`. For LoRA/Legacy modes:
+
+```bash
+# LoRA/Legacy with Pipeline Parallel (2 GPUs for PP)
+python -u example_trainer/vllm_api_server.py \
+    --model Qwen/Qwen2.5-14B-Instruct \
+    --tensor-parallel-size 1 \
+    --pipeline-parallel-size 2 \
+    --port 9001
+```
+
+**Note**: PP requires the model to be split across GPUs by layers. Performance may vary.
+
+#### Data Parallel (DP)
+
+Data parallelism means running **multiple trainer instances** against the same vLLM server. Each trainer processes different batches:
+
+```bash
+# Terminal 1: First trainer instance
+CUDA_VISIBLE_DEVICES=4 python -u example_trainer/grpo.py \
+    --model-name Qwen/Qwen2.5-7B-Instruct \
+    --weight-bridge-mode lora_only \
+    --trainer-rank 0 \
+    --world-size 2 \
+    > trainer_0.log 2>&1 &
+
+# Terminal 2: Second trainer instance
+CUDA_VISIBLE_DEVICES=5 python -u example_trainer/grpo.py \
+    --model-name Qwen/Qwen2.5-7B-Instruct \
+    --weight-bridge-mode lora_only \
+    --trainer-rank 1 \
+    --world-size 2 \
+    > trainer_1.log 2>&1 &
+```
+
+**Note**: DP requires gradient synchronization between trainers. Currently, trainers operate independently - true distributed DP would need additional coordination.
+
+### GPU Support
+
+| GPU Type | Single-Copy | LoRA | Legacy | Notes |
+|----------|-------------|------|--------|-------|
+| **NVIDIA A100** | YES | YES | YES | Recommended |
+| **NVIDIA H100** | YES | YES | YES | Recommended |
+| **NVIDIA B200** | YES | YES | YES | Recommended |
+| **NVIDIA RTX 4090** | YES | YES | YES | Consumer, works well |
+| **NVIDIA RTX 3090** | YES | YES | YES | Consumer, works well |
+| **NVIDIA V100** | ? | YES | YES | Old, may have IPC issues |
+
+### Memory Requirements (Approximate)
+
+| Model Size | Single-Copy | LoRA | Legacy |
+|------------|-------------|------|--------|
+| 0.5B - 1B | 4-6 GB | 4-6 GB | 8-12 GB |
+| 3B | 8-12 GB | 8-12 GB | 16-24 GB |
+| 7B | 16-20 GB | 16-20 GB | 32-40 GB |
+| 14B | 32-40 GB | 32-40 GB | 64-80 GB |
+| 32B | 70-80 GB | 70-80 GB | 140+ GB |
+| 70B+ | Single GPU impossible | 80+ GB | 160+ GB |
+
+> **Note**: Single-copy mode uses ~50% less memory than legacy because there's only ONE model copy.
+
+---
+
+## Known Limitations
+
+### Single-Copy Mode Specific
+
+| Limitation | Reason | Workaround |
+|------------|--------|------------|
+| **Same GPU Required** | CUDA IPC only works within same physical device | Use same `CUDA_VISIBLE_DEVICES` for trainer and vLLM |
+| **TP=1 Only** | Trainer expects unsharded model; IPC per-device | Use LoRA mode for TP > 1 |
+| **Custom Server Required** | Standard `vllm serve` doesn't export IPC handles | Use `vllm_api_server.py` |
+| **Single Node Only** | CUDA IPC is node-local | Use LoRA/Legacy for multi-node |
+
+### LoRA Mode Specific
+
+| Limitation | Reason | Workaround |
+|------------|--------|------------|
+| **~5s Swap Latency** | Adapter weights need to be loaded | Acceptable for most use cases |
+| **vLLM LoRA Support Required** | Model must support LoRA in vLLM | Check vLLM documentation |
+
+### General Limitations
+
+| Limitation | Reason | Workaround |
+|------------|--------|------------|
+| **GSM8k Needs `server_type=vllm`** | Default `openai` type lacks state tracking | Use `--openai.server_type vllm` |
+| **Decoder-Only Models Only** | vLLM architecture constraint | Use different framework for encoder-decoder |
+| **Custom vLLM Server Required** | Standard `vllm serve` lacks IPC patches | Use `vllm_api_server.py` for all modes |
+
+---
+
+## Future Work
+
+### High Priority 
+
+| Feature | Description |
+|---------|-------------|
+| **Multi-GPU Single-Copy** | Support `tensor-parallel-size > 1` with sharded IPC |
+| **Automatic Server Type Detection** | Auto-detect correct `server_type` for environments |
+| **Checkpoint Resume** | Resume training from checkpoints seamlessly |
+
+### Medium Priority 
+
+| Feature | Description | Difficulty |
+|---------|-------------|------------|
+| **DeepSpeed Integration** | ZeRO optimization for larger models | Hard |
+| **Quantization Support** | Test and document GPTQ/AWQ in single-copy | Medium |
+| **Multi-Node Training** | Distributed training across nodes | Hard |
+| **Streaming Weights** | Stream weight updates instead of full sync | Medium |
+| **Mixed Precision Training** | Support fp8/int8 training | Medium |
+
+
+## Contributing
+
+We welcome contributions! Priority areas:
+
+1. **Multi-GPU single-copy support** - The biggest missing feature
+2. **Better documentation** - More examples, tutorials
+3. **Environment implementations** - New RL environments
+4. **Bug fixes** - Especially edge cases in IPC handling
+
+See the main repository CONTRIBUTING.md for guidelines.