Update README.md (#118)

This commit is contained in:
fuder.eth 2025-05-28 03:24:12 +03:00 committed by GitHub
parent f21154ff49
commit 1862b193ee
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -23,7 +23,7 @@ This design pattern allows the agent to be trained on manageable segments that f
## The Problem: Ultra-Long Episode Sequences
Reinforcement learning for language models involves training on sequences of observations, thoughts, and actions. When an agent engages in detailed "thinking" steps (e.g., within `<think>...</think>` blocks), the token length of a single turn can be substantial. Over an entire episode, the total sequence length can easily exceed the maximum sequence length (e.g., 4k, 8k, or even 16k tokens) that contemporary LLMs can process in a single forward pass for training (ie, the maximum sequence length in the trainer). Even a short environment like Blackjack can blow this up in only a couple of steps if it happens to use very long thinking blocks. We could probably accomodate this for Blackjack by increasing the maximum sequence length, at the cost of more resources for the trainer, but this isn't a great solution in general. Many environments can take hundreds or thousands of steps to complete - so a way of managing this is necessary. Blackjack is just used to demonstrate how this works - it can be much more simply implemented without thinking blocks as in `blackjack_env_no_thinking`, but even WITHOUT thinking blocks there's plenty of environments that will run far beyond any reasonable maximum sequence length anyway. Not to mention agents that might make use of RAG or some kind of external planning that will further bloat the token count.
Reinforcement learning for language models involves training on sequences of observations, thoughts, and actions. When an agent engages in detailed "thinking" steps (e.g., within `<think>...</think>` blocks), the token length of a single turn can be substantial. Over an entire episode, the total sequence length can easily exceed the maximum sequence length (e.g., 4k, 8k, or even 16k tokens) that contemporary LLMs can process in a single forward pass for training (ie, the maximum sequence length in the trainer). Even a short environment like Blackjack can blow this up in only a couple of steps if it happens to use very long thinking blocks. We could probably accommodate this for Blackjack by increasing the maximum sequence length, at the cost of more resources for the trainer, but this isn't a great solution in general. Many environments can take hundreds or thousands of steps to complete - so a way of managing this is necessary. Blackjack is just used to demonstrate how this works - it can be much more simply implemented without thinking blocks as in `blackjack_env_no_thinking`, but even WITHOUT thinking blocks there's plenty of environments that will run far beyond any reasonable maximum sequence length anyway. Not to mention agents that might make use of RAG or some kind of external planning that will further bloat the token count.
Standard RL training loops assume that an entire episode or a significant, coherent sub-segment (rollout) can be fed into the policy network (ie, the LLM being trained). When episodes are orders of magnitude longer than the `seq_len`, a naive approach of simply truncating or randomly sampling segments breaks the coherence needed for effective policy evaluation and improvement, especially for algorithms like GRPO (**Group Relative Policy Optimization**) that rely on comparing alternative actions from a common state.
@ -41,7 +41,7 @@ Key components of this approach:
\[ A(s_t, a_i) = R_i + \gamma V(s'_{i}) - V(s_t) \]
(In `_collect_trajectory`, `gamma` is effectively 1, and \(R_i\) is represented by `alt_combined_rewards[i]`, \(V(s'_{i})\) by `alt_value_next[i]`, and \(V(s_t)\) by `value_t`).
Note: This has nothing to do with GPRO's internal advantage calculations! Don't get it mixed up, this is just used to help provide some credit for intermediate actions where immediate action rewards aren't available (as well as selecting the next canoncial action). Supplementing the actually winning trajectory scores (as in, the canonical trajectory) with the final outcome and a discount factor to assign credit to earlier actions would be an obvious improvement, which has been left off to keep things a little simpler (and will be explored more in another environment with longer trajectories and more sparse rewards where it might matter more to training)
Note: This has nothing to do with GPRO's internal advantage calculations! Don't get it mixed up, this is just used to help provide some credit for intermediate actions where immediate action rewards aren't available (as well as selecting the next canonical action). Supplementing the actually winning trajectory scores (as in, the canonical trajectory) with the final outcome and a discount factor to assign credit to earlier actions would be an obvious improvement, which has been left off to keep things a little simpler (and will be explored more in another environment with longer trajectories and more sparse rewards where it might matter more to training)
* **Choosing the Path (`select_best_index`)**: The `select_best_index` function is then used to pick the alternative with the highest calculated advantage. This chosen alternative's action is what is actually "played" in the environment, advancing the episode to the next state `s_{t+1}`. The other `G-1` alternatives serve as counterfactual data for training. So, we end up with a "canonical" trajectory through the environment. For more comprehensive exploration of alternatives, we'd need to use some more comprehensive form of search like MCTS, which is overkill for something like Blackjack (but we'll demo in some other more complex environments to be added)