diff --git a/environments/community/mcp_tool_calling/GRPO_README.md b/environments/community/mcp_tool_calling/GRPO_README.md index f176c75f..0a7b14f5 100644 --- a/environments/community/mcp_tool_calling/GRPO_README.md +++ b/environments/community/mcp_tool_calling/GRPO_README.md @@ -100,77 +100,3 @@ atropos-grpo \ - If vLLM health checks time out, inspect `vllm.log`, `trainer.log`, and `env.log`. - If targeted shared-layer runs lose gradients, ensure non-reentrant checkpointing is enabled in shared mode. - If environment workers time out at 600s, reduce env concurrency (`--env.max_num_workers_per_node`) and batch pressure. -# GRPO Example Trainer - -This directory contains an example script (`grpo.py`) demonstrating how to integrate a custom training loop with the Atropos API for reinforcement learning using the GRPO (Group Relative Policy Optimization) algorithm. - -**Note: Example trainer does not support multimodal training out of the box. As other trainers add support for Atropos, we will list them in the main readme, some of which may support multimodal RL - please check the main repo readme for any updates.** - -This example uses `vLLM` for efficient inference during the (simulated) data generation phase and `transformers` for the training phase. - -**Note:** This script is intended as a *reference example* for API integration and basic training setup. It is not optimized for large-scale, efficient training. - -## Prerequisites - -1. **Python:** Python 3.8 or higher is recommended. -2. **Atropos API Server:** The Atropos API server must be running and accessible (defaults to `http://localhost:8000` in the script). -3. **Python Packages:** You need to install the required Python libraries: - * `torch` (with CUDA support recommended) - * `transformers` - * `vllm` - * `pydantic` - * `numpy` - * `requests` - * `tenacity` - * `wandb` (optional, for logging) - -## Setup - -1. **Clone the Repository:** Ensure you have the repository containing this example. -2. **Install Dependencies:** `pip install -r requirements.txt` -3. **Ensure Atropos API is Running:** `run-api` in a new window -4. **Run an env:** `python environments/gsm8k_server.py serve --slurm False` - -## Configuration - -The training configuration is managed within the `grpo.py` script using the `TrainingConfig` Pydantic model (found near the top of the file). - -Key parameters you might want to adjust include: - -* `model_name`: The Hugging Face model identifier to use for training (e.g., `"gpt2"`, `"Qwen/Qwen2.5-1.5B-Instruct"`). -* `training_steps`: The total number of optimization steps to perform. -* `batch_size` / `gradient_accumulation_steps`: Control the effective batch size. -* `lr`: Learning rate. -* `save_path`: Directory where model checkpoints will be saved. -* `vllm_port`: The port used by the vLLM server instance launched by this script. -* `vllm_restart_interval`: How often (in steps) to save a checkpoint and restart the vLLM server with the new weights. -* `use_wandb`: Set to `True` to enable logging to Weights & Biases. -* `wandb_project`: Your W&B project name (required if `use_wandb=True`). -* `wandb_group`: Optional W&B group name. - -**API Endpoints:** The script currently assumes the Atropos API is available at `http://localhost:8000/register` and `http://localhost:8000/batch`. If your API runs elsewhere, you'll need to modify the `register_trainer` and `get_batch` functions accordingly. - -## Running the Example - -Once the prerequisites are met and configuration is set: - -1. Navigate to the root directory of the project in your terminal. -2. Run the script: - - ```bash - python example_trainer/grpo.py - ``` - -## Output - -* **Logs:** Training progress, loss, logp, and vLLM status will be printed to the console. -* **Checkpoints:** Model checkpoints will be saved periodically in the directory specified by `save_path` (default: `./trained_model_checkpoints`). A `final_model` directory will be created upon completion. -* **WandB:** If `use_wandb` is `True`, logs will be sent to Weights & Biases. A link to the run page will be printed in the console. -* `temp.json`: Contains the raw data from the last fetched batch (used for debugging/manual inspection). - -```bash -# Install dependencies -pip install -e ".[example_trainer]" - -# Run the trainer directly (basic test) -python example_trainer/grpo.py diff --git a/example_trainer/README.md b/example_trainer/README.md index 1ad3944b..d5e19643 100644 --- a/example_trainer/README.md +++ b/example_trainer/README.md @@ -169,7 +169,7 @@ python -m example_trainer.grpo \ ### Startup Order ```bash -# CRITICAL: Follow this exact order! +# Follow this startup order # 1. Start API first run-api --port 8002 @@ -306,7 +306,7 @@ environment uses the `/generate` path and includes token-level ### 2. Clipping Is Essential -**CRITICAL:** Keep clipping enabled to avoid unstable policy updates: +Keep clipping enabled to avoid unstable policy updates: ```bash --clip-eps 0.2 # Limits importance sampling ratio to [0.8, 1.2] @@ -869,7 +869,7 @@ If your model has `N` layers: "tokens": [[tok1, tok2, ...], ...], # group_size sequences "masks": [[mask1, mask2, ...], ...], # -100 for prompt, token_id for generated "scores": [score1, score2, ...], # rewards - "inference_logprobs": [[lp1, lp2, ...], ...], # CRITICAL for GRPO! + "inference_logprobs": [[lp1, lp2, ...], ...], # required for this GRPO trainer "generation_params": {"temperature": 1.0}, ... } diff --git a/example_trainer/training.py b/example_trainer/training.py index fadcfcb5..92e18f2e 100644 --- a/example_trainer/training.py +++ b/example_trainer/training.py @@ -74,8 +74,8 @@ def compute_grpo_loss( """ Compute GRPO (Group Relative Policy Optimization) loss for a single micro-batch. - This implements proper GRPO/PPO with: - - Importance sampling ratio: policy(a|s) / policy_old(a|s) + This implements GRPO/PPO-style clipped ratio training with: + - Importance sampling ratio from current logprobs vs rollout inference_logprobs - PPO-style clipping to prevent large updates The loss encourages the model to: @@ -89,7 +89,7 @@ def compute_grpo_loss( advantages: Advantage values [batch, 1] temperatures: Temperature values [batch, 1, 1] gradient_accumulation_steps: Number of accumulation steps (for scaling) - inference_logprobs: Logprobs from inference (π_old), aligned with labels [batch, seq_len] + inference_logprobs: Rollout logprobs from inference, aligned with labels [batch, seq_len] clip_eps: PPO clipping epsilon. Clips ratio to [1-eps, 1+eps] Returns: @@ -163,7 +163,8 @@ def compute_grpo_loss( f" [DEBUG] Logprob gap: ref={ref_at_generated:.3f}, train={train_at_generated:.3f}" ) - # Compute importance sampling ratio: policy(a|s) / policy_old(a|s) = exp(log policy - log policy_old) + # Compute importance ratio from current training logprobs and rollout inference_logprobs. + # ratio = exp(current_logprob - rollout_inference_logprob) log_ratio = logp_per_token - ref_logprobs ratio = torch.exp(log_ratio) @@ -217,7 +218,7 @@ def compute_grpo_loss( " 2. Ensure vLLM is returning logprobs in /generate response\n" " 3. Check that gsm8k_server is configured correctly\n" "\n" - "Without inference logprobs, training will cause reward hacking." + "This trainer path requires inference_logprobs and aborts without them." ) # === Compute Additional Metrics ===