linting

2026-04-24 17:04:55 +00:00 · 2025-05-14 14:01:31 -07:00 · 2025-05-14 14:01:31 -07:00 · 67cfd961c5
commit 67cfd961c5
parent 826de9e283
6 changed files with 111 additions and 85 deletions
--- a/environments/game_environments/gymnasium/README.md
+++ b/environments/game_environments/gymnasium/README.md
@ -33,7 +33,7 @@ Key components of this approach:
        \[ A(s_t, a_i) = R_i + \gamma V(s'_{i}) - V(s_t) \]
        (In `_collect_trajectory`, `gamma` is effectively 1, and \(R_i\) is represented by `alt_combined_rewards[i]`, \(V(s'_{i})\) by `alt_value_next[i]`, and \(V(s_t)\) by `value_t`).

-    Note: This has nothing to do with GPRO's internal advantage calculations! Don't get it mixed up, this is just used to help provide some credit for intermediate actions where immediate action rewards aren't available (as well as selecting the next canoncial action). Supplementing the actually winning trajectory scores (as in, the canonical trajectory) with the final outcome and a discount factor to assign credit to earlier actions would be an obvious improvement, which has been left off to keep things a little simpler (and will be explored more in another environment with longer trajectories and more sparse rewards where it might matter more to training) 
+    Note: This has nothing to do with GPRO's internal advantage calculations! Don't get it mixed up, this is just used to help provide some credit for intermediate actions where immediate action rewards aren't available (as well as selecting the next canoncial action). Supplementing the actually winning trajectory scores (as in, the canonical trajectory) with the final outcome and a discount factor to assign credit to earlier actions would be an obvious improvement, which has been left off to keep things a little simpler (and will be explored more in another environment with longer trajectories and more sparse rewards where it might matter more to training)

    *   **Choosing the Path (`select_best_index`)**: The `select_best_index` function is then used to pick the alternative with the highest calculated advantage. This chosen alternative's action is what is actually "played" in the environment, advancing the episode to the next state `s_{t+1}`. The other `G-1` alternatives serve as counterfactual data for training. So, we end up with a "canonical" trajectory through the environment. For more comprehensive exploration of alternatives, we'd need to use some more comprehensive form of search like MCTS, which is overkill for something like Blackjack (but we'll demo in some other more complex environments to be added)

@ -62,7 +62,7 @@ The GRPO trainer typically computes a loss using these advantages. For example,
 \[ L = -\sum_{j=1}^{M} \sum_{k=1}^{K_j} \left( \frac{\pi_{\theta}(a_{jk} | s_j)}{\pi_{\theta_{\text{old}}}(a_{jk} | s_j)} A_{jk}^{\text{GRPO}} \right) \]
 (often with a KL divergence penalty for stability, ensuring the new policy \(\pi_{\theta}\) doesn\'t deviate too drastically from the old policy \(\pi_{\theta_{\text{old}}}\\)). The `ratio = torch.exp(logp - logp.detach())` and `loss = -reward * ratio` (where `reward` is the \(A_{jk}^{\text{GRPO}}\) advantage) in a typical trainer snippet would align with this principle.

-The `blackjack_env_thinking` environment's design is compatible with GRPO's core requirements for input data BUT allowing it to be used across long trajectories. We don't get a nice, well defined reward at every step of every environment - but we want to keep that nice, objective, outcome-oriented RLVR-style reward structure, even in reward-sparse environments. 
+The `blackjack_env_thinking` environment's design is compatible with GRPO's core requirements for input data BUT allowing it to be used across long trajectories. We don't get a nice, well defined reward at every step of every environment - but we want to keep that nice, objective, outcome-oriented RLVR-style reward structure, even in reward-sparse environments.

 1.  **Alternative Generation**: From a state \(s_t\), the environment generates `G` alternative continuations (thoughts and actions \(a_1, ..., a_G\)).
 2.  **Value-Informed Scoring (within the environment)**: For each alternative \(a_i\), the environment itself calculates a "score" using its internal value estimation: \(S_i = R_i + \gamma V_{\text{env}}(s'_{i}) - V_{\text{env}}(s_t)\). This score, \(S_i\), represents a local, value-informed assessment of that alternative's quality.
@ -102,4 +102,3 @@ In the `blackjack_env_no_thinking` environment:
 *   The entire sequence of (observation, action, LLM response) usually fits within the model's `seq_len`. Blackjack is at most a few turns, so this is ok if you JUST want to train on actions, not additional long chains of thought.
 *   `collect_trajectory` returns a single `ScoredDataItem` representing the full episode. The "score" is simply the final game outcome (e.g., +1 for a win) and some bonuses for formatting and correct tool calling.
 *   The trainer can then process these entire episodes using the normal GRPO method (ie, we're just sending the full alternative trajectories and their scores to be compared, similar to the single-step bandit problems people are commonly using for RLVR). The complexity of per-step alternative generation for windowing and local value estimation isn't needed for fitting within `seq_len`.
-