## Quick Start ### Option A: Unified End-to-End Launcher ```bash python -m environments.dataset_environment.launch_local_dataset_run ``` This single command spins up: 1. The Trajectory Handler API server (`uvicorn atroposlib.api.server:app`) 2. The DatasetEnv in serve mode (connected to the API) 3. The example GRPO trainer (via `example_trainer.grpo.train`) ### Option B: Manual Steps 1. **Start the API server** ```bash uvicorn atroposlib.api.server:app --host 127.0.0.1 --port 8000 ``` 2. **Launch the Dataset Environment** - **Using CLI flags**: (These flags override any config file settings) ```bash python -m environments.dataset_environment.dataset_env serve \ --group_size 4 \ --max_num_workers 2 \ --rollout_server_url http://127.0.0.1:8000 \ --tokenizer_name Qwen/Qwen2.5-1.5B-Instruct \ --use_wandb --wandb_name dataset_env_local_test \ --max_token_length 512 \ --ensure_scores_are_not_same \ --dataset_name HuggingFaceH4/testing_self_instruct_process_essays \ --split train[:100] \ --prompt_field prompt --answer_field answer \ --reward_functions length \ --max_tokens 128 --temperature 0.7 \ --model_name Qwen/Qwen2.5-1.5B-Instruct \ --base_url http://127.0.0.1:9001 \ --slurm --testing ``` - **Using YAML config files**: Place a dataset config under `environments/dataset_environment/configs/.yaml`: ```yaml # Example: environments/dataset_environment/configs/gsm8k.yaml dataset: dataset_name: "gsm8k" dataset_config: "main" split: "train" prompt_field: "question" answer_field: "answer" system_prompt: "You are a mathematical problem solver..." generation: temperature: 0.7 top_p: 0.95 reward_functions: - type: "accuracy" weight: 1.0 ``` Then run the local test server: ```bash # Will look for environments/dataset_environment/configs/gsm8k.yaml python environments/dataset_environment/dataset_local_server.py --config gsm8k ``` 3. **Launch the Trainer** ```bash python -m example_trainer.grpo ``` ## Configuration Files Directory Dataset environment specific configurations now live in `environments/dataset_environment/configs/`. Shared configurations (like agents) might still reside in the project's root `configs/` directory. - `environments/dataset_environment/configs/` for dataset-specific configs (used by `dataset_local_server.py`). - You can reference any `.yaml` within this directory via the `--config` flag in the local server script. ## Reward Function Registry & Customization Reward functions are managed by a centralized registry (see `atroposlib/envs/reward_fns/reward_function.py`). Built-in types include: - `accuracy`: exact match to ground truth (tolerance, split_on_think_tag) - `format`: checks for specific tags (preferred_tags) - `reasoning_steps`: quality of step-by-step reasoning - `repetition_penalty`: penalizes repetition - `cosine_scaled`: semantic similarity scaled from embeddings - `crossword_format`: crossword-specific penalty - `r1`: combined accuracy + format To preview all available functions: ```python from atroposlib.envs.reward_fns import registry print(registry.list()) ``` ### Creating Custom Reward Functions 1. Create a new file under `atroposlib/envs/reward_fns/my_reward.py`. 2. Subclass `RewardFunction` and register it: ```python from atroposlib.envs.reward_fns import registry, RewardFunction @registry.register class MyCustomReward(RewardFunction): def __init__(self, custom_param=1.0, weight=1.0, **kwargs): super().__init__(weight=weight, **kwargs) self.custom_param = custom_param def compute(self, completions, **kwargs): return [1.0 if "good answer" in self.get_content(c) else 0.0 for c in completions] ``` 3. Reference it in your YAML config: ```yaml reward_functions: - type: "my_custom" weight: 1.0 params: custom_param: 2.0 ``` ### Dataset Environments Dataset environments load data from HuggingFace datasets and evaluate LLM responses against ground truth. They're ideal for academic benchmarks and datasets with clear evaluation criteria. Example configuration: ```yaml dataset: dataset_name: "gsm8k" dataset_config: "main" split: "train" prompt_field: "question" answer_field: "answer" system_prompt: "You are a mathematical problem solver..." reward_functions: - type: "accuracy" weight: 1.0 ``` ## Reward Functions The system features a flexible reward function architecture for evaluating model outputs. ### Basic Usage In your environment config, specify reward functions: ```yaml reward_functions: - type: "accuracy" weight: 1.0 - type: "format" weight: 0.5 ``` ### Combining Reward Functions Combine multiple reward functions with weights: ```yaml reward_functions: - type: "combined" params: normalization: "sum" rewards: - type: "accuracy" weight: 1.5 - type: "format" weight: 0.5 ``` ### Available Reward Functions #### `accuracy` Evaluates if completions match ground truth answers. ```yaml type: "accuracy" weight: 1.0 params: tolerance: 1e-6 split_on_think_tag: true max_boxed_threshold: 6 ``` #### `format` Checks if completions include specific XML-style tags. ```yaml type: "format" weight: 1.0 params: preferred_tags: ["think", "reasoning"] require_all_tags: false case_sensitive: false ``` #### `reasoning_steps` Evaluates step-by-step reasoning quality. ```yaml type: "reasoning_steps" weight: 1.0 params: min_words: 10 min_steps: 3 base_score: 0.1 ``` #### `repetition_penalty` Penalizes repetitive content. ```yaml type: "repetition_penalty" weight: 0.5 params: threshold: 0.05 min_words: 10 min_sentences: 2 ``` #### `cosine_scaled` Measures semantic similarity between completions and solutions. ```yaml type: "cosine_scaled" weight: 0.8 params: model_name: "sentence-transformers/all-MiniLM-L6-v2" scale_factor: 1.0 min_reward: -1.0 max_reward: 1.0 ``` #### `crossword_format` Game-specific reward for crossword puzzles. ```yaml type: "crossword_format" weight: 1.0 params: reward_value: 1.0 penalize_invalid_chars: true ``` #### `r1` Combined reward using both reasoning format and accuracy. ```yaml type: "r1" weight: 1.0 params: format_weight: 0.5 accuracy_weight: 1.0 ``` ### Creating Custom Reward Functions To create a custom reward function: 1. Create a new file in `atroposlib/envs/reward_fns/my_reward.py` 2. Define your reward function class: ```python from typing import Any, List from atroposlib.envs.reward_fns import registry, RewardFunction @registry.register class MyCustomReward(RewardFunction): def __init__(self, custom_param=1.0, weight=1.0, **kwargs): super().__init__(weight=weight, **kwargs) self.custom_param = custom_param def compute(self, completions: List[Any], **kwargs) -> List[float]: rewards = [] for completion in completions: content = self.get_content(completion) # Implement your reward logic reward = 1.0 if "good answer" in content else 0.0 rewards.append(reward) return rewards ``` 3. Use it in your config: ```yaml reward_functions: - type: "my_custom" weight: 1.0 params: custom_param: 2.0 ``` ### Dataset Environment Debugger The dataset environment debugger allows you to run a dataset environment locally with a Hugging Face model, providing enhanced visibility into reward function performance and model responses. ```bash # Run with default settings python -m atroposlib.cli.dataset_env_debugger --env gsm8k_debug --agent nous_hermes_8b # List available environments and agents python -m atroposlib.cli.dataset_env_debugger --list-configs # Interactive mode with debugging information python -m atroposlib.cli.dataset_env_debugger --env gsm8k_debug --agent nous_hermes_8b --interactive --debug # Run with custom generation parameters python -m atroposlib.cli.dataset_env_debugger --env gsm8k_debug --agent nous_hermes_8b --temperature 0.5 --top-p 0.95 # Run with detailed logging python -m atroposlib.cli.dataset_env_debugger --env gsm8k_debug --agent nous_hermes_8b --verbose ``` ## Environment Overview This environment demonstrates how to use a standard dataset (e.g., from Hugging Face Datasets) as a source for generating prompts and evaluating LLM responses. It allows for testing and training models on established benchmarks or custom datasets where prompts and expected answers/ground truth are available. **Demonstrates:** - Loading and processing data from Hugging Face Datasets. - Configuring system prompts, prompt/answer fields. - Applying various reward functions (accuracy, format, semantic similarity, etc.) to evaluate generations. - Integrating with the `atroposlib` framework for data collection and scoring. **Training Goal:** - To train LLMs to follow instructions and generate responses that align with the format and content specified by the dataset and reward functions. - To improve performance on specific tasks defined by datasets (e.g., math problem solving, code generation, question answering). ## Local Testing To test this environment locally, you can run the provided local server. This server simulates the interaction flow without needing the full distributed setup. First, ensure you have the necessary dependencies installed. Then, run the local server script from the root of the repository: ```bash python environments/dataset_environment/dataset_local_server.py --config-path path/to/your/dataset_config.yaml ``` Replace `path/to/your/dataset_config.yaml` with the actual path to your environment configuration file (e.g., `configs/envs/gsm8k.yaml`). The server will load the dataset specified in the config, process items, and simulate generating responses. FOR RELEASE - FIX