mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
355 lines
9.9 KiB
Markdown
355 lines
9.9 KiB
Markdown
## Quick Start
|
|
|
|
### Option A: Unified End-to-End Launcher
|
|
|
|
```bash
|
|
python -m environments.dataset_environment.launch_local_dataset_run
|
|
```
|
|
This single command spins up:
|
|
1. The Trajectory Handler API server (`uvicorn atroposlib.api.server:app`)
|
|
2. The DatasetEnv in serve mode (connected to the API)
|
|
3. The example GRPO trainer (via `example_trainer.grpo.train`)
|
|
|
|
### Option B: Manual Steps
|
|
|
|
1. **Start the API server**
|
|
|
|
```bash
|
|
uvicorn atroposlib.api.server:app --host 127.0.0.1 --port 8000
|
|
```
|
|
|
|
2. **Launch the Dataset Environment**
|
|
|
|
- **Using CLI flags**:
|
|
(These flags override any config file settings)
|
|
```bash
|
|
python -m environments.dataset_environment.dataset_env serve \
|
|
--group_size 4 \
|
|
--max_num_workers 2 \
|
|
--rollout_server_url http://127.0.0.1:8000 \
|
|
--tokenizer_name Qwen/Qwen2.5-1.5B-Instruct \
|
|
--use_wandb --wandb_name dataset_env_local_test \
|
|
--max_token_length 512 \
|
|
--ensure_scores_are_not_same \
|
|
--dataset_name HuggingFaceH4/testing_self_instruct_process_essays \
|
|
--split train[:100] \
|
|
--prompt_field prompt --answer_field answer \
|
|
--reward_functions length \
|
|
--max_tokens 128 --temperature 0.7 \
|
|
--model_name Qwen/Qwen2.5-1.5B-Instruct \
|
|
--base_url http://127.0.0.1:9001 \
|
|
--slurm --testing
|
|
```
|
|
|
|
- **Using YAML config files**:
|
|
|
|
Place a dataset config under `environments/dataset_environment/configs/<name>.yaml`:
|
|
```yaml
|
|
# Example: environments/dataset_environment/configs/gsm8k.yaml
|
|
dataset:
|
|
dataset_name: "gsm8k"
|
|
dataset_config: "main"
|
|
split: "train"
|
|
prompt_field: "question"
|
|
answer_field: "answer"
|
|
system_prompt: "You are a mathematical problem solver..."
|
|
|
|
generation:
|
|
temperature: 0.7
|
|
top_p: 0.95
|
|
|
|
reward_functions:
|
|
- type: "accuracy"
|
|
weight: 1.0
|
|
```
|
|
|
|
Then run the local test server:
|
|
```bash
|
|
# Will look for environments/dataset_environment/configs/gsm8k.yaml
|
|
python environments/dataset_environment/dataset_local_server.py --config gsm8k
|
|
```
|
|
|
|
3. **Launch the Trainer**
|
|
|
|
```bash
|
|
python -m example_trainer.grpo
|
|
```
|
|
|
|
## Configuration Files Directory
|
|
|
|
Dataset environment specific configurations now live in `environments/dataset_environment/configs/`.
|
|
Shared configurations (like agents) might still reside in the project's root `configs/` directory.
|
|
|
|
- `environments/dataset_environment/configs/` for dataset-specific configs (used by `dataset_local_server.py`).
|
|
- You can reference any `<name>.yaml` within this directory via the `--config` flag in the local server script.
|
|
|
|
## Reward Function Registry & Customization
|
|
|
|
Reward functions are managed by a centralized registry (see `atroposlib/envs/reward_fns/reward_function.py`). Built-in types include:
|
|
|
|
- `accuracy`: exact match to ground truth (tolerance, split_on_think_tag)
|
|
- `format`: checks for specific tags (preferred_tags)
|
|
- `reasoning_steps`: quality of step-by-step reasoning
|
|
- `repetition_penalty`: penalizes repetition
|
|
- `cosine_scaled`: semantic similarity scaled from embeddings
|
|
- `crossword_format`: crossword-specific penalty
|
|
- `r1`: combined accuracy + format
|
|
|
|
To preview all available functions:
|
|
```python
|
|
from atroposlib.envs.reward_fns import registry
|
|
print(registry.list())
|
|
```
|
|
|
|
### Creating Custom Reward Functions
|
|
|
|
1. Create a new file under `atroposlib/envs/reward_fns/my_reward.py`.
|
|
2. Subclass `RewardFunction` and register it:
|
|
|
|
```python
|
|
from atroposlib.envs.reward_fns import registry, RewardFunction
|
|
|
|
@registry.register
|
|
class MyCustomReward(RewardFunction):
|
|
def __init__(self, custom_param=1.0, weight=1.0, **kwargs):
|
|
super().__init__(weight=weight, **kwargs)
|
|
self.custom_param = custom_param
|
|
|
|
def compute(self, completions, **kwargs):
|
|
return [1.0 if "good answer" in self.get_content(c) else 0.0 for c in completions]
|
|
```
|
|
|
|
3. Reference it in your YAML config:
|
|
|
|
```yaml
|
|
reward_functions:
|
|
- type: "my_custom"
|
|
weight: 1.0
|
|
params:
|
|
custom_param: 2.0
|
|
```
|
|
|
|
### Dataset Environments
|
|
|
|
Dataset environments load data from HuggingFace datasets and evaluate LLM responses against ground truth. They're ideal for academic benchmarks and datasets with clear evaluation criteria.
|
|
|
|
Example configuration:
|
|
```yaml
|
|
dataset:
|
|
dataset_name: "gsm8k"
|
|
dataset_config: "main"
|
|
split: "train"
|
|
prompt_field: "question"
|
|
answer_field: "answer"
|
|
system_prompt: "You are a mathematical problem solver..."
|
|
reward_functions:
|
|
- type: "accuracy"
|
|
weight: 1.0
|
|
```
|
|
|
|
## Reward Functions
|
|
|
|
The system features a flexible reward function architecture for evaluating model outputs.
|
|
|
|
### Basic Usage
|
|
|
|
In your environment config, specify reward functions:
|
|
|
|
```yaml
|
|
reward_functions:
|
|
- type: "accuracy"
|
|
weight: 1.0
|
|
- type: "format"
|
|
weight: 0.5
|
|
```
|
|
|
|
### Combining Reward Functions
|
|
|
|
Combine multiple reward functions with weights:
|
|
|
|
```yaml
|
|
reward_functions:
|
|
- type: "combined"
|
|
params:
|
|
normalization: "sum"
|
|
rewards:
|
|
- type: "accuracy"
|
|
weight: 1.5
|
|
- type: "format"
|
|
weight: 0.5
|
|
```
|
|
|
|
### Available Reward Functions
|
|
|
|
#### `accuracy`
|
|
Evaluates if completions match ground truth answers.
|
|
|
|
```yaml
|
|
type: "accuracy"
|
|
weight: 1.0
|
|
params:
|
|
tolerance: 1e-6
|
|
split_on_think_tag: true
|
|
max_boxed_threshold: 6
|
|
```
|
|
|
|
#### `format`
|
|
Checks if completions include specific XML-style tags.
|
|
|
|
```yaml
|
|
type: "format"
|
|
weight: 1.0
|
|
params:
|
|
preferred_tags: ["think", "reasoning"]
|
|
require_all_tags: false
|
|
case_sensitive: false
|
|
```
|
|
|
|
#### `reasoning_steps`
|
|
Evaluates step-by-step reasoning quality.
|
|
|
|
```yaml
|
|
type: "reasoning_steps"
|
|
weight: 1.0
|
|
params:
|
|
min_words: 10
|
|
min_steps: 3
|
|
base_score: 0.1
|
|
```
|
|
|
|
#### `repetition_penalty`
|
|
Penalizes repetitive content.
|
|
|
|
```yaml
|
|
type: "repetition_penalty"
|
|
weight: 0.5
|
|
params:
|
|
threshold: 0.05
|
|
min_words: 10
|
|
min_sentences: 2
|
|
```
|
|
|
|
#### `cosine_scaled`
|
|
Measures semantic similarity between completions and solutions.
|
|
|
|
```yaml
|
|
type: "cosine_scaled"
|
|
weight: 0.8
|
|
params:
|
|
model_name: "sentence-transformers/all-MiniLM-L6-v2"
|
|
scale_factor: 1.0
|
|
min_reward: -1.0
|
|
max_reward: 1.0
|
|
```
|
|
|
|
#### `crossword_format`
|
|
Game-specific reward for crossword puzzles.
|
|
|
|
```yaml
|
|
type: "crossword_format"
|
|
weight: 1.0
|
|
params:
|
|
reward_value: 1.0
|
|
penalize_invalid_chars: true
|
|
```
|
|
|
|
#### `r1`
|
|
Combined reward using both reasoning format and accuracy.
|
|
|
|
```yaml
|
|
type: "r1"
|
|
weight: 1.0
|
|
params:
|
|
format_weight: 0.5
|
|
accuracy_weight: 1.0
|
|
```
|
|
|
|
### Creating Custom Reward Functions
|
|
|
|
To create a custom reward function:
|
|
|
|
1. Create a new file in `atroposlib/envs/reward_fns/my_reward.py`
|
|
|
|
2. Define your reward function class:
|
|
|
|
```python
|
|
from typing import Any, List
|
|
from atroposlib.envs.reward_fns import registry, RewardFunction
|
|
|
|
@registry.register
|
|
class MyCustomReward(RewardFunction):
|
|
def __init__(self, custom_param=1.0, weight=1.0, **kwargs):
|
|
super().__init__(weight=weight, **kwargs)
|
|
self.custom_param = custom_param
|
|
|
|
def compute(self, completions: List[Any], **kwargs) -> List[float]:
|
|
rewards = []
|
|
for completion in completions:
|
|
content = self.get_content(completion)
|
|
# Implement your reward logic
|
|
reward = 1.0 if "good answer" in content else 0.0
|
|
rewards.append(reward)
|
|
return rewards
|
|
```
|
|
|
|
3. Use it in your config:
|
|
|
|
```yaml
|
|
reward_functions:
|
|
- type: "my_custom"
|
|
weight: 1.0
|
|
params:
|
|
custom_param: 2.0
|
|
```
|
|
|
|
### Dataset Environment Debugger
|
|
|
|
The dataset environment debugger allows you to run a dataset environment locally with a Hugging Face model, providing enhanced visibility into reward function performance and model responses.
|
|
|
|
```bash
|
|
# Run with default settings
|
|
python -m atroposlib.cli.dataset_env_debugger --env gsm8k_debug --agent nous_hermes_8b
|
|
|
|
# List available environments and agents
|
|
python -m atroposlib.cli.dataset_env_debugger --list-configs
|
|
|
|
# Interactive mode with debugging information
|
|
python -m atroposlib.cli.dataset_env_debugger --env gsm8k_debug --agent nous_hermes_8b --interactive --debug
|
|
|
|
# Run with custom generation parameters
|
|
python -m atroposlib.cli.dataset_env_debugger --env gsm8k_debug --agent nous_hermes_8b --temperature 0.5 --top-p 0.95
|
|
|
|
# Run with detailed logging
|
|
python -m atroposlib.cli.dataset_env_debugger --env gsm8k_debug --agent nous_hermes_8b --verbose
|
|
```
|
|
|
|
## Environment Overview
|
|
|
|
This environment demonstrates how to use a standard dataset (e.g., from Hugging Face Datasets) as a source for generating prompts and evaluating LLM responses. It allows for testing and training models on established benchmarks or custom datasets where prompts and expected answers/ground truth are available.
|
|
|
|
**Demonstrates:**
|
|
- Loading and processing data from Hugging Face Datasets.
|
|
- Configuring system prompts, prompt/answer fields.
|
|
- Applying various reward functions (accuracy, format, semantic similarity, etc.) to evaluate generations.
|
|
- Integrating with the `atroposlib` framework for data collection and scoring.
|
|
|
|
**Training Goal:**
|
|
- To train LLMs to follow instructions and generate responses that align with the format and content specified by the dataset and reward functions.
|
|
- To improve performance on specific tasks defined by datasets (e.g., math problem solving, code generation, question answering).
|
|
|
|
## Local Testing
|
|
|
|
To test this environment locally, you can run the provided local server. This server simulates the interaction flow without needing the full distributed setup.
|
|
|
|
First, ensure you have the necessary dependencies installed.
|
|
|
|
Then, run the local server script from the root of the repository:
|
|
|
|
```bash
|
|
python environments/dataset_environment/dataset_local_server.py --config-path path/to/your/dataset_config.yaml
|
|
```
|
|
|
|
Replace `path/to/your/dataset_config.yaml` with the actual path to your environment configuration file (e.g., `configs/envs/gsm8k.yaml`). The server will load the dataset specified in the config, process items, and simulate generating responses.
|
|
|
|
|
|
FOR RELEASE - FIX
|