add readme section

2026-04-19 12:57:58 +00:00 · 2025-07-27 02:46:51 +00:00 · 2025-07-27 02:46:51 +00:00 · a0979eb08e
commit a0979eb08e
parent 31b0c6f66d
1 changed files with 156 additions and 0 deletions
--- a/environments/README.md
+++ b/environments/README.md
@ -581,6 +581,162 @@ python instruction_following_algorithm_environment.py serve \

 ---

+### Arena-Hard Environment (`arena_hard_environment.py`)
+
+A high-quality benchmark environment implementing the Arena-Hard evaluation pipeline with Claude Sonnet 4 as judge, designed to train and evaluate models against challenging real-world user queries from Chatbot Arena.
+
+**Based on:** [Arena-Hard-Auto v0.1](https://lmsys.org/blog/2024-04-19-arena-hard/) by LMSYS ORG
+- **Citation:** Li, T., Chiang, W. L., Frick, E., Dunlap, L., Zhu, B., Gonzalez, J. E., & Stoica, I. (2024). From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline. LMSYS ORG Blog.
+
+**Key Features:**
+- **Claude Sonnet 4 Judge**: Uses state-of-the-art Claude Sonnet 4 model for robust response evaluation
+- **Dual-Round Judging**: Implements Arena-Hard methodology with two judgment rounds to reduce position bias
+- **Thinking Mode Support**: Full `<think></think>` tag parsing and validation for advanced reasoning
+- **GPT-4 Baseline Comparison**: Evaluates model responses against high-quality GPT-4-0314 baseline responses
+- **Real-World Queries**: 500 challenging prompts extracted from 200K+ user queries in Chatbot Arena
+- **Comprehensive Metrics**: Win rates, category breakdowns, and Arena-Hard compatible scoring
+
+**Input Format:**
+Each training/evaluation item contains:
+- `uid`: Unique identifier for prompt-baseline pairing
+- `prompt`: The user query from Arena-Hard dataset
+- `answer`: GPT-4-0314 baseline response (for comparison)
+- `category`: Optional category classification
+- `cluster`: Optional topic cluster assignment
+
+**Dataset Schema:**
+- **Training Prompts**: `NousResearch/arena-hard-v1-prompts` (HuggingFace) or local JSONL
+- **Training Baselines**: `NousResearch/gpt-4-0314-baseline-arenahard` (HuggingFace) or local JSONL
+- **Evaluation Prompts**: Same as training by default, configurable separately
+- **Evaluation Baselines**: Same as training by default, configurable separately
+
+**System Prompt (Thinking Mode - Default):**
+```
+You are a deep thinking AI assistant. Before providing your response, you should think through the problem carefully. Use <think></think> tags to enclose your internal reasoning and thought process, then provide your final response after the thinking tags.
+```
+
+**System Prompt (Non-Thinking Mode):**
+- Uses `custom_system_prompt` if provided, otherwise no system prompt
+
+**Judge System Prompt:**
+```
+Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by providing a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: "[[A>B]]" if assistant A is better, "[[B>A]]" if assistant B is better, and "[[A=B]]" for a tie.
+```
+
+**Evaluation Methodology:**
+1. **Model Response Generation**: Generate response to Arena-Hard prompt using configured temperature/tokens
+2. **Thinking Validation**: If thinking mode enabled, validate exactly one `<think></think>` pair and extract content after tags
+3. **Dual-Round Judging**: 
+   - Round 1: Judge model response (A) vs GPT-4 baseline (B)
+   - Round 2: Judge GPT-4 baseline (A) vs model response (B)
+4. **Score Combination**: Average the two judgment scores using Arena-Hard logic
+5. **Arena Score Conversion**: Convert [-1,1] range to [0,1] winrate format
+
+**Reward Function:**
+- **Training**: Scores range from -1.0 to 1.0 based on combined judgment results
+  - 1.0: Model response clearly better than baseline
+  - 0.0: Tie between model and baseline  
+  - -1.0: Baseline clearly better than model response
+- **Invalid Thinking**: Automatic 0.0 score for malformed `<think></think>` tags
+- **Evaluation**: Converted to Arena-Hard winrate format (0.0 to 1.0)
+
+**Configuration Options (`ArenaHardConfig`):**
+
+**Thinking Mode:**
+- `thinking_mode`: Enable `<think></think>` reasoning mode (default: False)
+- `custom_thinking_prompt`: Custom thinking prompt (default: uses built-in prompt)
+- `custom_system_prompt`: Additional system prompt to append (default: None)
+
+**Judge Settings:**
+- `judge_temperature`: Temperature for Claude Sonnet 4 judge (default: 0.2)
+- `judge_max_tokens`: Max tokens for judge responses (default: 4096)
+
+**Model Generation:**
+- `eval_temperature`: Temperature for evaluation completions (default: 0.6)
+- `rollout_temperature`: Temperature for training rollouts (default: 1.0)
+- `eval_max_tokens`: Max tokens for evaluation (default: 40960)
+- `train_max_tokens`: Max tokens for training (default: 16384)
+
+**Dataset Configuration:**
+- `train_prompt_dataset`: Training prompts dataset/path (default: "NousResearch/arena-hard-v1-prompts")
+- `train_baseline_dataset`: Training baselines dataset/path (default: "NousResearch/gpt-4-0314-baseline-arenahard")
+- `eval_prompt_dataset`: Evaluation prompts dataset/path (default: same as training)
+- `eval_baseline_dataset`: Evaluation baselines dataset/path (default: same as training)
+- `train_split`/`eval_split`: Dataset splits to use (default: "train")
+
+**Reliability:**
+- `max_retries`: Maximum API call retries (default: 3)
+- `retry_delay`: Delay between retries in seconds (default: 1.0)
+- `min_response_length`: Minimum valid response length (default: 10)
+
+**Usage Examples:**
+
+**Training:**
+```bash
+# Basic training with thinking mode
+python arena_hard_environment.py serve \
+    --env.thinking_mode=True \
+    --env.rollout_temperature=1.0 \
+    --env.group_size=8
+
+# Training without thinking mode
+python arena_hard_environment.py serve \
+    --env.thinking_mode=False \
+    --env.custom_system_prompt="You are a helpful assistant." \
+    --env.eval_temperature=0.0
+
+# Training with custom datasets
+python arena_hard_environment.py serve \
+    --env.train_prompt_dataset="/path/to/custom_prompts.jsonl" \
+    --env.train_baseline_dataset="/path/to/custom_baselines.jsonl"
+```
+
+**Evaluation:**
+```bash
+# Evaluate model performance
+python arena_hard_environment.py evaluate \
+    --env.thinking_mode=True \
+    --env.eval_temperature=0.0 \
+    --env.judge_temperature=0.0
+
+# Evaluate with debug logging
+python arena_hard_environment.py evaluate \
+    --env.full_debug=True \
+    --env.eval_max_tokens=8192
+```
+
+**Evaluation Metrics:**
+- `eval/overall_winrate`: Overall Arena-Hard winrate (0.0 to 1.0)
+- `eval/winrate_{category}`: Per-category winrates when available
+- `eval/win_count`/`eval/tie_count`/`eval/loss_count`: Raw judgment counts
+- `eval/win_rate`/`eval/tie_rate`/`eval/loss_rate`: Judgment proportions
+- `eval/total_samples`: Number of evaluation samples processed
+
+**Training Metrics:**
+- `train/winrate`: Training winrate based on judgment outcomes
+- `train/win_rate`/`train/tie_rate`/`train/loss_rate`: Training judgment distributions
+- `train/total_judgments`: Total judgments made during training
+- `config/thinking_mode`: Whether thinking mode is enabled (1.0/0.0)
+
+**Dependencies:**
+- `openai` (for Claude Sonnet 4 API via Anthropic's OpenAI-compatible endpoint)
+- `datasets` (for HuggingFace dataset loading)
+- `tiktoken` (for tokenization)
+- `wandb` (for metrics tracking)
+- `tqdm` (for progress bars)
+
+**Environment Variables Required:**
+- `ANTHROPIC_API_KEY`: API key for Claude Sonnet 4 judge
+
+**Key Implementation Details:**
+- **Position Bias Reduction**: Dual-round judging with position swapping
+- **Robust Parsing**: Multiple regex patterns for judgment extraction ([[A>B]], [[B>A]], [[A=B]])
+- **Thinking Validation**: Strict validation of thinking tag format and content extraction
+- **Error Handling**: Comprehensive retry logic with exponential backoff
+- **Arena-Hard Compatibility**: Scores and metrics match original Arena-Hard methodology
+
+---
+
 ### SWE-RL Environment (`swe_rl_env.py`)

 Software Engineering Reinforcement Learning environment for training models to fix bugs based on issue descriptions and code context.