diff --git a/example_trainer/README.md b/example_trainer/README.md index d5e19643..a6820614 100644 --- a/example_trainer/README.md +++ b/example_trainer/README.md @@ -180,7 +180,7 @@ python -m example_trainer.vllm_api_server --model ... --enable-lora --enforce-ea # 3. Wait for vLLM health endpoint to return 200 while ! curl -s http://localhost:9001/health > /dev/null; do sleep 1; done -# 4. Start environment (MUST use --openai.server_type vllm for logprobs) +# 4. Start environment (use --openai.server_type vllm for logprobs) python environments/gsm8k_server.py serve \ --env.group_size 4 \ --env.batch_size 16 \ @@ -290,10 +290,10 @@ environment uses the `/generate` path and includes token-level `inference_logprobs` in the trajectory payload consumed by the trainer. ```bash -# CORRECT - gets logprobs for training (REQUIRED!) +# gets logprobs for training --openai.server_type vllm -# WRONG for this trainer path - missing rollout inference_logprobs +# does NOT return rollout inference_logprobs — trainer will error --openai.server_type openai ``` @@ -304,25 +304,20 @@ environment uses the `/generate` path and includes token-level 4. Trainer extracts and aligns logprobs with training labels 5. GRPO loss uses these rollout logprobs in importance-ratio terms -### 2. Clipping Is Essential - -Keep clipping enabled to avoid unstable policy updates: +### 2. Clipping ```bash --clip-eps 0.2 # Limits importance sampling ratio to [0.8, 1.2] ``` -**Why this matters:** -- **PPO Clipping** (ε): Clips the importance sampling ratio to `[1-ε, 1+ε]` - - Prevents catastrophically large policy updates - - Takes pessimistic bound (conservative update) - **Symptoms of missing/misconfigured clipping:** - Accuracy drops dramatically (e.g., 59% → 7%) - Loss goes to very negative values (< -10) - Model outputs become repetitive/degenerate - `mean_ratio` diverges far from 1.0 +For background on clipping and importance sampling, see https://fengyao.notion.site/off-policy-rl + ### 3. Use LR Warmup for Stability Use a short linear warmup when training from fresh runs or small batch settings: