mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
add answer format environment for rejection sampling
This commit is contained in:
parent
24dd0a71b4
commit
8e1d160eef
2 changed files with 4731 additions and 0 deletions
473
environments/answer_format_environment/README.md
Normal file
473
environments/answer_format_environment/README.md
Normal file
|
|
@ -0,0 +1,473 @@
|
|||
# Answer Format Environment
|
||||
|
||||
A comprehensive training environment for teaching language models to generate responses in specific structured formats. This environment focuses on **format adherence** rather than answer correctness, using randomized format requirements and corresponding parsers to train models on structured response generation.
|
||||
|
||||
## 🎯 Overview
|
||||
|
||||
The Answer Format Environment trains models to:
|
||||
- Generate responses in 150+ different structured formats
|
||||
- Follow strict thinking tag discipline (`<think></think>`)
|
||||
- Parse and validate format compliance
|
||||
- Handle multiple dataset types with appropriate format selection
|
||||
- Maintain equivalent training ratios across formats (optional)
|
||||
|
||||
**Key Philosophy**: This environment scores based on format compliance (1.0 for correct format, 0.0 for incorrect), not answer accuracy. It teaches models to follow formatting instructions precisely.
|
||||
|
||||
## ✨ Key Features
|
||||
|
||||
### 🔄 **Randomized Format Selection**
|
||||
- 150+ supported answer formats across multiple categories
|
||||
- Weighted format selection (70% simple, 30% complex)
|
||||
- Dataset type-aware format filtering (generic, math_only, code_only)
|
||||
- Dynamic compositor system for complex structured responses
|
||||
|
||||
### 🧠 **Thinking Tag Validation**
|
||||
- Enforces exactly one `<think></think>` section per response
|
||||
- All reasoning must be contained within thinking tags
|
||||
- Strict validation prevents multiple thinking sections
|
||||
- Answer must appear after `</think>` in specified format
|
||||
|
||||
### 📊 **Comprehensive Data Management**
|
||||
- Multi-dataset support with automatic shuffling
|
||||
- Configurable train/eval splits
|
||||
- Extensive data dumping with group-level statistics
|
||||
- Failed rollout tracking and analysis
|
||||
- WandB integration with detailed metrics
|
||||
|
||||
### ⚖️ **Equivalent Ratio Enforcement**
|
||||
- Optional system to ensure balanced training across formats
|
||||
- Pauses formats after reaching success threshold
|
||||
- Prevents format bias in training data
|
||||
- Comprehensive monitoring and status reporting
|
||||
|
||||
### 🔍 **Advanced Monitoring**
|
||||
- Format success rate tracking
|
||||
- Group-level performance statistics
|
||||
- Failure case analysis and logging
|
||||
- Real-time format balance monitoring
|
||||
|
||||
## 📋 Supported Format Categories
|
||||
|
||||
### **Basic Structured Data**
|
||||
```json
|
||||
{"answer": "content"} // JSON
|
||||
answer: content // YAML
|
||||
answer = "content" // TOML
|
||||
```
|
||||
|
||||
### **XML/HTML Tags**
|
||||
```xml
|
||||
<answer>content</answer> // XML
|
||||
<answer>Final Answer: content</answer> // XML with prefix
|
||||
<output>content</output> // Output tags
|
||||
<result>content</result> // Result tags
|
||||
```
|
||||
|
||||
### **LaTeX Formats**
|
||||
```latex
|
||||
\boxed{content} // Text-friendly boxed
|
||||
$\boxed{expression}$ // Math-only boxed
|
||||
\begin{align} expression \end{align} // Math alignment
|
||||
$\text{answer}$ // Text in math mode
|
||||
```
|
||||
|
||||
### **Natural Language**
|
||||
```
|
||||
The answer is: content
|
||||
Final answer: content
|
||||
In conclusion: content
|
||||
Therefore: content
|
||||
```
|
||||
|
||||
### **Programming Formats**
|
||||
```python
|
||||
print("answer") // Python print
|
||||
console.log("answer") // JavaScript console
|
||||
# answer // Python comment
|
||||
return "answer" // Return statement
|
||||
```
|
||||
|
||||
### **Complex Multi-Tag Formats**
|
||||
```xml
|
||||
<restatement>...</restatement>
|
||||
<reasoning>...</reasoning>
|
||||
<solution>...</solution>
|
||||
<explanation>...</explanation>
|
||||
```
|
||||
|
||||
### **Dynamic Compositor Formats**
|
||||
Randomly combines 3-6 components in XML, JSON, YAML, or TOML:
|
||||
- Analysis components (problem_analysis, requirements_analysis, etc.)
|
||||
- Reasoning components (logical_reasoning, step_by_step, etc.)
|
||||
- Planning components (approach, methodology, etc.)
|
||||
- Technical components (implementation, code_structure, etc.)
|
||||
- Evaluation components (validation, testing, etc.)
|
||||
- Output components (final_answer, conclusion, etc.)
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### Basic Configuration
|
||||
|
||||
```python
|
||||
from atropos.environments.answer_format_environment import AnswerFormatEnv, AnswerFormatEnvConfig
|
||||
|
||||
# Simple configuration
|
||||
config = AnswerFormatEnvConfig(
|
||||
dataset_configs=[
|
||||
{
|
||||
"name": "your_dataset",
|
||||
"split": "train",
|
||||
"sample_size": 1000,
|
||||
"prompt_field": "question",
|
||||
"answer_field": "answer",
|
||||
"dataset_type": "generic"
|
||||
}
|
||||
],
|
||||
debug_logging=True,
|
||||
dump_rollouts=True,
|
||||
eval_set_percentage=0.1
|
||||
)
|
||||
|
||||
# Initialize environment
|
||||
env = AnswerFormatEnv(config, server_configs)
|
||||
```
|
||||
|
||||
### Multi-Dataset Configuration
|
||||
|
||||
```python
|
||||
config = AnswerFormatEnvConfig(
|
||||
dataset_configs=[
|
||||
{
|
||||
"name": "teknium/OpenHermes-2.5",
|
||||
"split": "train",
|
||||
"sample_size": 5000,
|
||||
"prompt_field": "conversations",
|
||||
"answer_field": "conversations",
|
||||
"metadata_fields": ["source"],
|
||||
"dataset_type": "generic"
|
||||
},
|
||||
{
|
||||
"name": "gsm8k",
|
||||
"split": "train",
|
||||
"sample_size": 2000,
|
||||
"prompt_field": "question",
|
||||
"answer_field": "answer",
|
||||
"dataset_type": "math_only"
|
||||
},
|
||||
{
|
||||
"name": "NousResearch/AcademicMCQA",
|
||||
"split": "train",
|
||||
"sample_size": 5000,
|
||||
"prompt_field": "prompt",
|
||||
"answer_field": "ground_truth",
|
||||
"metadata_fields": ["answer", "options"],
|
||||
"dataset_type": "generic"
|
||||
}
|
||||
],
|
||||
ensure_equivalent_ratios=True,
|
||||
format_group_threshold=50,
|
||||
dump_failed_rollouts=True
|
||||
)
|
||||
```
|
||||
|
||||
## ⚙️ Configuration Options
|
||||
|
||||
### **Dataset Configuration**
|
||||
```python
|
||||
dataset_configs: List[Dict[str, Any]] = [
|
||||
{
|
||||
"name": str, # Dataset name or HuggingFace path
|
||||
"split": str, # Dataset split ("train", "test", etc.)
|
||||
"sample_size": int, # Number of samples to use
|
||||
"prompt_field": str, # Field containing prompts/questions
|
||||
"answer_field": str, # Field containing answers
|
||||
"metadata_fields": List[str], # Additional fields to preserve
|
||||
"dataset_type": str # "generic", "math_only", or "code_only"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### **Core Settings**
|
||||
```python
|
||||
debug_logging: bool = True # Enable detailed logging
|
||||
dump_rollouts: bool = True # Save rollouts to JSONL
|
||||
dump_failed_rollouts: bool = True # Save failed rollouts separately
|
||||
rollout_save_score_threshold: float = 0.0 # Minimum score to save rollouts
|
||||
eval_set_percentage: float = 0.1 # Evaluation set percentage
|
||||
```
|
||||
|
||||
### **Format Control**
|
||||
```python
|
||||
supported_formats: Optional[List[AnswerFormat]] = None # Filter to specific formats
|
||||
ensure_equivalent_ratios: bool = False # Enable ratio enforcement
|
||||
format_group_threshold: int = 50 # Success threshold per format
|
||||
```
|
||||
|
||||
## 📊 Dataset Types & Format Selection
|
||||
|
||||
### **Generic Datasets** (`dataset_type: "generic"`)
|
||||
- **Available Formats**: All basic formats (JSON, XML, natural language, brackets, etc.)
|
||||
- **Use Case**: General conversation, QA, instruction following, MCQA
|
||||
- **Examples**: OpenHermes, Alpaca, AcademicMCQA, general chat datasets
|
||||
|
||||
### **Math-Only Datasets** (`dataset_type: "math_only"`)
|
||||
- **Available Formats**: Generic formats + LaTeX math expressions
|
||||
- **Additional Formats**: `$\boxed{}$`, `\begin{align}`, `$\text{}$`, etc.
|
||||
- **Use Case**: Mathematical problem solving
|
||||
- **Examples**: GSM8K, MATH, mathematical reasoning datasets
|
||||
|
||||
### **Code-Only Datasets** (`dataset_type: "code_only"`)
|
||||
- **Available Formats**: Generic formats + programming-specific formats
|
||||
- **Additional Formats**: `print()`, `console.log()`, comments, return statements
|
||||
- **Use Case**: Code generation, programming problems
|
||||
- **Examples**: HumanEval, MBPP, coding datasets
|
||||
|
||||
## 🎲 Dynamic Compositor System
|
||||
|
||||
The dynamic compositor creates complex structured responses by randomly combining components:
|
||||
|
||||
### **Component Categories**
|
||||
- **Analysis**: `problem_analysis`, `requirements_analysis`, `context_analysis`
|
||||
- **Reasoning**: `logical_reasoning`, `step_by_step`, `causal_reasoning`
|
||||
- **Planning**: `approach`, `methodology`, `strategy`
|
||||
- **Technical**: `implementation`, `code_structure`, `algorithm`
|
||||
- **Evaluation**: `validation`, `testing`, `verification`
|
||||
- **Output**: `final_answer`, `conclusion`, `summary`
|
||||
|
||||
### **Output Formats**
|
||||
- **XML**: `<component_name>content</component_name>`
|
||||
- **JSON**: `{"component_name": "content"}`
|
||||
- **YAML**: `component_name: content`
|
||||
- **TOML**: `component_name = "content"`
|
||||
|
||||
### **Example Dynamic Format**
|
||||
```xml
|
||||
<problem_analysis>Understanding the requirements...</problem_analysis>
|
||||
<logical_reasoning>Step by step analysis...</logical_reasoning>
|
||||
<approach>My methodology will be...</approach>
|
||||
<implementation>Here's the solution...</implementation>
|
||||
<final_answer>42</final_answer>
|
||||
```
|
||||
|
||||
## 📈 Monitoring & Analytics
|
||||
|
||||
### **WandB Metrics**
|
||||
- `train/percent_correct`: Overall format compliance rate
|
||||
- `train/format_success_rate_{format}`: Per-format success rates
|
||||
- `train/format_usage_count_{format}`: Usage frequency per format
|
||||
- `train/equivalent_ratio_paused_formats`: Number of paused formats
|
||||
- `train/group_success_rate`: Percentage of successful groups
|
||||
- `train/failed_groups_count`: Number of completely failed groups
|
||||
|
||||
### **Group-Level Statistics**
|
||||
```
|
||||
Format: json_confidence | Group average score: 0.7500 | 12/16 correct (75.0%)
|
||||
Format: latex_boxed_math | Group average score: 0.0000 | 0/16 correct (0.0%) (All failures!)
|
||||
Format: natural_language_answer | Group average score: 1.0000 | 16/16 correct (100.0%) (Perfect group!)
|
||||
```
|
||||
|
||||
### **Data Dumps**
|
||||
- **Regular Rollouts**: `answer_format_rollouts_{uuid}_{batch}.jsonl`
|
||||
- **Failed Rollouts**: `answer_format_failed_rollouts_{uuid}_{batch}.jsonl`
|
||||
- **Metadata**: Format type, scores, conversation history, timestamps
|
||||
- **Batch Size**: 100 groups per file
|
||||
|
||||
## ⚖️ Equivalent Ratio Enforcement
|
||||
|
||||
Optional system to ensure balanced training across all formats:
|
||||
|
||||
### **How It Works**
|
||||
1. Tracks successful groups per format
|
||||
2. Pauses formats that reach threshold (default: 50 successful groups)
|
||||
3. Continues training other formats until they catch up
|
||||
4. Resumes paused formats when balance is restored
|
||||
|
||||
### **Configuration**
|
||||
```python
|
||||
ensure_equivalent_ratios=True, # Enable the system
|
||||
format_group_threshold=50, # Success threshold per format
|
||||
```
|
||||
|
||||
### **Monitoring**
|
||||
```python
|
||||
status = env.get_equivalent_ratio_status()
|
||||
print(f"Paused formats: {status['paused_formats']}")
|
||||
print(f"Active formats: {status['active_formats']}")
|
||||
print(f"Progress: {status['format_progress']}")
|
||||
```
|
||||
|
||||
## 🔧 Advanced Usage
|
||||
|
||||
### **Custom Format Filtering**
|
||||
```python
|
||||
from atropos.environments.answer_format_environment import AnswerFormat
|
||||
|
||||
config = AnswerFormatEnvConfig(
|
||||
supported_formats=[
|
||||
AnswerFormat.JSON,
|
||||
AnswerFormat.XML,
|
||||
AnswerFormat.LATEX_BOXED,
|
||||
AnswerFormat.NATURAL_LANGUAGE_ANSWER
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### **Evaluation Mode**
|
||||
```python
|
||||
# Run evaluation on held-out set
|
||||
eval_score = await env.evaluate()
|
||||
print(f"Evaluation format compliance: {eval_score}")
|
||||
```
|
||||
|
||||
### **Supported Dataset Formats**
|
||||
|
||||
#### **OpenHermes Conversations**
|
||||
```python
|
||||
{
|
||||
"name": "teknium/OpenHermes-2.5",
|
||||
"prompt_field": "conversations",
|
||||
"answer_field": "conversations"
|
||||
}
|
||||
```
|
||||
|
||||
#### **GSM8K Math Problems**
|
||||
```python
|
||||
{
|
||||
"name": "gsm8k",
|
||||
"prompt_field": "question",
|
||||
"answer_field": "answer"
|
||||
}
|
||||
```
|
||||
|
||||
#### **AcademicMCQA Multiple Choice**
|
||||
```python
|
||||
{
|
||||
"name": "NousResearch/AcademicMCQA",
|
||||
"prompt_field": "prompt",
|
||||
"answer_field": "ground_truth",
|
||||
"metadata_fields": ["answer", "options"]
|
||||
}
|
||||
```
|
||||
|
||||
## 📝 Response Format Requirements
|
||||
|
||||
### **Thinking Tags**
|
||||
- **Required**: Exactly one `<think>` opening and one `</think>` closing tag
|
||||
- **Content**: All reasoning must be inside thinking tags
|
||||
- **Placement**: Answer must appear after `</think>` in specified format
|
||||
- **Validation**: No additional thinking tags allowed after first `</think>`
|
||||
|
||||
### **Example Valid Response**
|
||||
```
|
||||
<think>
|
||||
Let me analyze this problem step by step.
|
||||
First, I need to understand what's being asked...
|
||||
The solution involves calculating...
|
||||
</think>
|
||||
|
||||
{"answer": "42"}
|
||||
```
|
||||
|
||||
### **Example Invalid Response**
|
||||
```
|
||||
<think>Some reasoning</think>
|
||||
{"answer": "42"}
|
||||
<think>More reasoning</think> // ❌ Additional thinking tags not allowed
|
||||
```
|
||||
|
||||
## 🚨 Common Issues & Solutions
|
||||
|
||||
### **Format Validation Failures**
|
||||
- **Issue**: Response doesn't match expected format
|
||||
- **Solution**: Check regex patterns and ensure exact format compliance
|
||||
- **Debug**: Enable `debug_logging=True` for detailed validation info
|
||||
|
||||
### **Thinking Tag Violations**
|
||||
- **Issue**: Multiple thinking sections or missing tags
|
||||
- **Solution**: Ensure exactly one `<think></think>` section per response
|
||||
- **Debug**: Check thinking tag validation logs
|
||||
|
||||
### **Dataset Loading Errors**
|
||||
- **Issue**: Dataset not found or field missing
|
||||
- **Solution**: Verify dataset name, split, and field names
|
||||
- **Debug**: Check dataset configuration and field mappings
|
||||
|
||||
### **Memory Issues with Large Datasets**
|
||||
- **Issue**: Out of memory with large sample sizes
|
||||
- **Solution**: Reduce `sample_size` or use streaming datasets
|
||||
- **Debug**: Monitor memory usage and adjust batch sizes
|
||||
|
||||
## 🔍 Debugging & Development
|
||||
|
||||
### **Enable Debug Logging**
|
||||
```python
|
||||
config = AnswerFormatEnvConfig(debug_logging=True)
|
||||
```
|
||||
|
||||
### **Check Format Status**
|
||||
```python
|
||||
# Get equivalent ratio status
|
||||
status = env.get_equivalent_ratio_status()
|
||||
|
||||
# Check format success rates
|
||||
for format_name, success_rate in env.format_success_rates.items():
|
||||
print(f"{format_name}: {success_rate:.2%}")
|
||||
```
|
||||
|
||||
### **Analyze Failed Rollouts**
|
||||
```python
|
||||
# Failed rollouts are automatically saved when dump_failed_rollouts=True
|
||||
# Check the datadumps directory for analysis files
|
||||
```
|
||||
|
||||
## 📚 Technical Details
|
||||
|
||||
### **Scoring System**
|
||||
- **Format Compliance**: 1.0 for correct format, 0.0 for incorrect
|
||||
- **No Answer Accuracy**: Content correctness is not evaluated
|
||||
- **Group Scoring**: Average of all rollouts in a group
|
||||
- **Success Threshold**: Groups with >0.0 average are considered successful
|
||||
|
||||
### **Format Validation**
|
||||
- **Regex Patterns**: Each format has specific validation patterns
|
||||
- **Exact Matching**: Strict compliance required for scoring
|
||||
- **Content Extraction**: Validated content is extracted for consistency
|
||||
|
||||
### **Dynamic Format Generation**
|
||||
- **Component Selection**: Random selection of 3-6 components
|
||||
- **Format Templates**: XML, JSON, YAML, TOML output formats
|
||||
- **Validation Storage**: Components stored for precise validation
|
||||
|
||||
### **Special Dataset Handling**
|
||||
- **GSM8K**: Extracts numerical answers from `####` separated format
|
||||
- **AcademicMCQA**: Uses ground truth letters (A, B, C, D) as answers
|
||||
- **OpenHermes**: Extracts from conversation format with role-based parsing
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
### **Adding New Formats**
|
||||
1. Add enum value to `AnswerFormat`
|
||||
2. Add system prompt instruction
|
||||
3. Add validation regex pattern
|
||||
4. Add content extraction logic
|
||||
5. Test with sample responses
|
||||
|
||||
### **Adding New Dataset Types**
|
||||
1. Define dataset type in configuration
|
||||
2. Add format filtering logic
|
||||
3. Update format selection methods
|
||||
4. Test with representative datasets
|
||||
|
||||
### **Adding New Datasets**
|
||||
1. Add special handling in `setup()` method
|
||||
2. Define field mappings in configuration
|
||||
3. Add metadata extraction logic
|
||||
4. Test dataset loading and processing
|
||||
|
||||
## 📄 License
|
||||
|
||||
This environment is part of the Atropos training framework. See the main repository for license information.
|
||||
|
||||
---
|
||||
|
||||
**Need Help?** Check the debug logs, enable verbose logging, or review the comprehensive monitoring metrics for troubleshooting guidance.
|
||||
Loading…
Add table
Add a link
Reference in a new issue