mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-30 17:40:36 +00:00
remove KL
This commit is contained in:
parent
dbf6026165
commit
d2ea8cd612
11 changed files with 48 additions and 156 deletions
|
|
@ -81,14 +81,13 @@ atropos-grpo \
|
|||
--gradient-accumulation-steps 64 \
|
||||
--warmup-steps 5 \
|
||||
--training-steps 30 \
|
||||
--kl-coef 0.0 \
|
||||
--clip-eps 0.2
|
||||
```
|
||||
|
||||
## Objective Notes
|
||||
|
||||
- GRPO uses rollout/inference logprobs (`pi_old`) for importance-ratio computation.
|
||||
- The optional KL-like term is sampled-token regularization against rollout policy logprobs, not a separate frozen-reference-model KL.
|
||||
- The trainer currently uses clipped importance-ratio updates without a separate frozen-reference-model KL term.
|
||||
|
||||
## Outputs
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue