remove KL

2026-04-30 17:40:36 +00:00 · 2026-02-27 15:55:16 -05:00 · 2026-02-27 15:55:16 -05:00 · d2ea8cd612
commit d2ea8cd612
parent dbf6026165
11 changed files with 48 additions and 156 deletions
--- a/environments/community/mcp_tool_calling/GRPO_README.md
+++ b/environments/community/mcp_tool_calling/GRPO_README.md
@ -81,14 +81,13 @@ atropos-grpo \
  --gradient-accumulation-steps 64 \
  --warmup-steps 5 \
  --training-steps 30 \
-  --kl-coef 0.0 \
  --clip-eps 0.2
 ```

 ## Objective Notes

 - GRPO uses rollout/inference logprobs (`pi_old`) for importance-ratio computation.
- The optional KL-like term is sampled-token regularization against rollout policy logprobs, not a separate frozen-reference-model KL.
+- The trainer currently uses clipped importance-ratio updates without a separate frozen-reference-model KL term.

 ## Outputs