# Answer Format Environment

A comprehensive training environment for teaching language models to generate responses in specific structured formats. This environment focuses on **format adherence** rather than answer correctness, using randomized format requirements and corresponding parsers to train models on structured response generation.

## 🎯 Overview

The Answer Format Environment trains models to:
- Generate responses in 150+ different structured formats
- Follow strict thinking tag discipline (`<think></think>`)
- Parse and validate format compliance
- Handle multiple dataset types with appropriate format selection
- Maintain equivalent training ratios across formats (optional)

**Key Philosophy**: This environment scores based on format compliance (1.0 for correct format, 0.0 for incorrect), not answer accuracy. It teaches models to follow formatting instructions precisely.

## ✨ Key Features

### 🔄 **Randomized Format Selection**
- 150+ supported answer formats across multiple categories
- Weighted format selection (70% simple, 30% complex)
- Dataset type-aware format filtering (generic, math_only, code_only)
- Dynamic compositor system for complex structured responses

### 🧠 **Thinking Tag Validation**
- Enforces exactly one `<think></think>` section per response
- All reasoning must be contained within thinking tags
- Strict validation prevents multiple thinking sections
- Answer must appear after `</think>` in specified format

### 📊 **Comprehensive Data Management**
- Multi-dataset support with automatic shuffling
- Configurable train/eval splits
- Extensive data dumping with group-level statistics
- Failed rollout tracking and analysis
- WandB integration with detailed metrics

### ⚖️ **Equivalent Ratio Enforcement**
- Optional system to ensure balanced training across formats
- Pauses formats after reaching success threshold
- Prevents format bias in training data
- Comprehensive monitoring and status reporting

### 🔍 **Advanced Monitoring**
- Format success rate tracking
- Group-level performance statistics
- Failure case analysis and logging
- Real-time format balance monitoring

## 📋 Supported Format Categories

### **Basic Structured Data**
```json
{"answer": "content"}                    // JSON
answer: content                          // YAML  
answer = "content"                       // TOML
```

### **XML/HTML Tags**
```xml
<answer>content</answer>                 // XML
<answer>Final Answer: content</answer>   // XML with prefix
<output>content</output>                 // Output tags
<result>content</result>                 // Result tags
```

### **LaTeX Formats**
```latex
\boxed{content}                          // Text-friendly boxed
$\boxed{expression}$                     // Math-only boxed
\begin{align} expression \end{align}     // Math alignment
$\text{answer}$                          // Text in math mode
```

### **Natural Language**
```
The answer is: content
Final answer: content
In conclusion: content
Therefore: content
```

### **Programming Formats**
```python
print("answer")                          // Python print
console.log("answer")                    // JavaScript console
# answer                                 // Python comment
return "answer"                          // Return statement
```

### **Complex Multi-Tag Formats**
```xml
<restatement>...</restatement>
<reasoning>...</reasoning>
<solution>...</solution>
<explanation>...</explanation>
```

### **Dynamic Compositor Formats**
Randomly combines 3-6 components in XML, JSON, YAML, or TOML:
- Analysis components (problem_analysis, requirements_analysis, etc.)
- Reasoning components (logical_reasoning, step_by_step, etc.)
- Planning components (approach, methodology, etc.)
- Technical components (implementation, code_structure, etc.)
- Evaluation components (validation, testing, etc.)
- Output components (final_answer, conclusion, etc.)

## 🚀 Quick Start

### Basic Configuration

```python
from atropos.environments.answer_format_environment import AnswerFormatEnv, AnswerFormatEnvConfig

# Simple configuration
config = AnswerFormatEnvConfig(
    dataset_configs=[
        {
            "name": "your_dataset",
            "split": "train",
            "sample_size": 1000,
            "prompt_field": "question",
            "answer_field": "answer",
            "dataset_type": "generic"
        }
    ],
    debug_logging=True,
    dump_rollouts=True,
    eval_set_percentage=0.1
)

# Initialize environment
env = AnswerFormatEnv(config, server_configs)
```

### Multi-Dataset Configuration

```python
config = AnswerFormatEnvConfig(
    dataset_configs=[
        {
            "name": "teknium/OpenHermes-2.5",
            "split": "train",
            "sample_size": 5000,
            "prompt_field": "conversations",
            "answer_field": "conversations",
            "metadata_fields": ["source"],
            "dataset_type": "generic"
        },
        {
            "name": "gsm8k",
            "split": "train", 
            "sample_size": 2000,
            "prompt_field": "question",
            "answer_field": "answer",
            "dataset_type": "math_only"
        },
        {
            "name": "NousResearch/AcademicMCQA",
            "split": "train",
            "sample_size": 5000,
            "prompt_field": "prompt",
            "answer_field": "ground_truth",
            "metadata_fields": ["answer", "options"],
            "dataset_type": "generic"
        }
    ],
    ensure_equivalent_ratios=True,
    format_group_threshold=50,
    dump_failed_rollouts=True
)
```

## ⚙️ Configuration Options

### **Dataset Configuration**
```python
dataset_configs: List[Dict[str, Any]] = [
    {
        "name": str,                    # Dataset name or HuggingFace path
        "split": str,                   # Dataset split ("train", "test", etc.)
        "sample_size": int,             # Number of samples to use
        "prompt_field": str,            # Field containing prompts/questions
        "answer_field": str,            # Field containing answers
        "metadata_fields": List[str],   # Additional fields to preserve
        "dataset_type": str             # "generic", "math_only", or "code_only"
    }
]
```

### **Core Settings**
```python
debug_logging: bool = True                    # Enable detailed logging
dump_rollouts: bool = True                    # Save rollouts to JSONL
dump_failed_rollouts: bool = True             # Save failed rollouts separately
rollout_save_score_threshold: float = 0.0    # Minimum score to save rollouts
eval_set_percentage: float = 0.1              # Evaluation set percentage
```

### **Format Control**
```python
supported_formats: Optional[List[AnswerFormat]] = None  # Filter to specific formats
ensure_equivalent_ratios: bool = False                  # Enable ratio enforcement
format_group_threshold: int = 50                        # Success threshold per format
```

## 📊 Dataset Types & Format Selection

### **Generic Datasets** (`dataset_type: "generic"`)
- **Available Formats**: All basic formats (JSON, XML, natural language, brackets, etc.)
- **Use Case**: General conversation, QA, instruction following, MCQA
- **Examples**: OpenHermes, Alpaca, AcademicMCQA, general chat datasets

### **Math-Only Datasets** (`dataset_type: "math_only"`)
- **Available Formats**: Generic formats + LaTeX math expressions
- **Additional Formats**: `$\boxed{}$`, `\begin{align}`, `$\text{}$`, etc.
- **Use Case**: Mathematical problem solving
- **Examples**: GSM8K, MATH, mathematical reasoning datasets

### **Code-Only Datasets** (`dataset_type: "code_only"`)
- **Available Formats**: Generic formats + programming-specific formats
- **Additional Formats**: `print()`, `console.log()`, comments, return statements
- **Use Case**: Code generation, programming problems
- **Examples**: HumanEval, MBPP, coding datasets

## 🎲 Dynamic Compositor System

The dynamic compositor creates complex structured responses by randomly combining components:

### **Component Categories**
- **Analysis**: `problem_analysis`, `requirements_analysis`, `context_analysis`
- **Reasoning**: `logical_reasoning`, `step_by_step`, `causal_reasoning`
- **Planning**: `approach`, `methodology`, `strategy`
- **Technical**: `implementation`, `code_structure`, `algorithm`
- **Evaluation**: `validation`, `testing`, `verification`
- **Output**: `final_answer`, `conclusion`, `summary`

### **Output Formats**
- **XML**: `<component_name>content</component_name>`
- **JSON**: `{"component_name": "content"}`
- **YAML**: `component_name: content`
- **TOML**: `component_name = "content"`

### **Example Dynamic Format**
```xml
<problem_analysis>Understanding the requirements...</problem_analysis>
<logical_reasoning>Step by step analysis...</logical_reasoning>
<approach>My methodology will be...</approach>
<implementation>Here's the solution...</implementation>
<final_answer>42</final_answer>
```

## 📈 Monitoring & Analytics

### **WandB Metrics**
- `train/percent_correct`: Overall format compliance rate
- `train/format_success_rate_{format}`: Per-format success rates
- `train/format_usage_count_{format}`: Usage frequency per format
- `train/equivalent_ratio_paused_formats`: Number of paused formats
- `train/group_success_rate`: Percentage of successful groups
- `train/failed_groups_count`: Number of completely failed groups

### **Group-Level Statistics**
```
Format: json_confidence | Group average score: 0.7500 | 12/16 correct (75.0%)
Format: latex_boxed_math | Group average score: 0.0000 | 0/16 correct (0.0%) (All failures!)
Format: natural_language_answer | Group average score: 1.0000 | 16/16 correct (100.0%) (Perfect group!)
```

### **Data Dumps**
- **Regular Rollouts**: `answer_format_rollouts_{uuid}_{batch}.jsonl`
- **Failed Rollouts**: `answer_format_failed_rollouts_{uuid}_{batch}.jsonl`
- **Metadata**: Format type, scores, conversation history, timestamps
- **Batch Size**: 100 groups per file

## ⚖️ Equivalent Ratio Enforcement

Optional system to ensure balanced training across all formats:

### **How It Works**
1. Tracks successful groups per format
2. Pauses formats that reach threshold (default: 50 successful groups)
3. Continues training other formats until they catch up
4. Resumes paused formats when balance is restored

### **Configuration**
```python
ensure_equivalent_ratios=True,    # Enable the system
format_group_threshold=50,        # Success threshold per format
```

### **Monitoring**
```python
status = env.get_equivalent_ratio_status()
print(f"Paused formats: {status['paused_formats']}")
print(f"Active formats: {status['active_formats']}")
print(f"Progress: {status['format_progress']}")
```

## 🔧 Advanced Usage

### **Custom Format Filtering**
```python
from atropos.environments.answer_format_environment import AnswerFormat

config = AnswerFormatEnvConfig(
    supported_formats=[
        AnswerFormat.JSON,
        AnswerFormat.XML,
        AnswerFormat.LATEX_BOXED,
        AnswerFormat.NATURAL_LANGUAGE_ANSWER
    ]
)
```

### **Evaluation Mode**
```python
# Run evaluation on held-out set
eval_score = await env.evaluate()
print(f"Evaluation format compliance: {eval_score}")
```

### **Supported Dataset Formats**

#### **OpenHermes Conversations**
```python
{
    "name": "teknium/OpenHermes-2.5",
    "prompt_field": "conversations",
    "answer_field": "conversations"
}
```

#### **GSM8K Math Problems**
```python
{
    "name": "gsm8k",
    "prompt_field": "question",
    "answer_field": "answer"
}
```

#### **AcademicMCQA Multiple Choice**
```python
{
    "name": "NousResearch/AcademicMCQA",
    "prompt_field": "prompt",
    "answer_field": "ground_truth",
    "metadata_fields": ["answer", "options"]
}
```

## 📝 Response Format Requirements

### **Thinking Tags**
- **Required**: Exactly one `<think>` opening and one `</think>` closing tag
- **Content**: All reasoning must be inside thinking tags
- **Placement**: Answer must appear after `</think>` in specified format
- **Validation**: No additional thinking tags allowed after first `</think>`

### **Example Valid Response**
```
<think>
Let me analyze this problem step by step.
First, I need to understand what's being asked...
The solution involves calculating...
</think>

{"answer": "42"}
```

### **Example Invalid Response**
```
<think>Some reasoning</think>
{"answer": "42"}
<think>More reasoning</think>  // ❌ Additional thinking tags not allowed
```

## 🚨 Common Issues & Solutions

### **Format Validation Failures**
- **Issue**: Response doesn't match expected format
- **Solution**: Check regex patterns and ensure exact format compliance
- **Debug**: Enable `debug_logging=True` for detailed validation info

### **Thinking Tag Violations**
- **Issue**: Multiple thinking sections or missing tags
- **Solution**: Ensure exactly one `<think></think>` section per response
- **Debug**: Check thinking tag validation logs

### **Dataset Loading Errors**
- **Issue**: Dataset not found or field missing
- **Solution**: Verify dataset name, split, and field names
- **Debug**: Check dataset configuration and field mappings

### **Memory Issues with Large Datasets**
- **Issue**: Out of memory with large sample sizes
- **Solution**: Reduce `sample_size` or use streaming datasets
- **Debug**: Monitor memory usage and adjust batch sizes

## 🔍 Debugging & Development

### **Enable Debug Logging**
```python
config = AnswerFormatEnvConfig(debug_logging=True)
```

### **Check Format Status**
```python
# Get equivalent ratio status
status = env.get_equivalent_ratio_status()

# Check format success rates
for format_name, success_rate in env.format_success_rates.items():
    print(f"{format_name}: {success_rate:.2%}")
```

### **Analyze Failed Rollouts**
```python
# Failed rollouts are automatically saved when dump_failed_rollouts=True
# Check the datadumps directory for analysis files
```

## 📚 Technical Details

### **Scoring System**
- **Format Compliance**: 1.0 for correct format, 0.0 for incorrect
- **No Answer Accuracy**: Content correctness is not evaluated
- **Group Scoring**: Average of all rollouts in a group
- **Success Threshold**: Groups with >0.0 average are considered successful

### **Format Validation**
- **Regex Patterns**: Each format has specific validation patterns
- **Exact Matching**: Strict compliance required for scoring
- **Content Extraction**: Validated content is extracted for consistency

### **Dynamic Format Generation**
- **Component Selection**: Random selection of 3-6 components
- **Format Templates**: XML, JSON, YAML, TOML output formats
- **Validation Storage**: Components stored for precise validation

### **Special Dataset Handling**
- **GSM8K**: Extracts numerical answers from `####` separated format
- **AcademicMCQA**: Uses ground truth letters (A, B, C, D) as answers
- **OpenHermes**: Extracts from conversation format with role-based parsing

## 🤝 Contributing

### **Adding New Formats**
1. Add enum value to `AnswerFormat`
2. Add system prompt instruction
3. Add validation regex pattern
4. Add content extraction logic
5. Test with sample responses

### **Adding New Dataset Types**
1. Define dataset type in configuration
2. Add format filtering logic
3. Update format selection methods
4. Test with representative datasets

### **Adding New Datasets**
1. Add special handling in `setup()` method
2. Define field mappings in configuration
3. Add metadata extraction logic
4. Test dataset loading and processing

## 📄 License

This environment is part of the Atropos training framework. See the main repository for license information.

---

**Need Help?** Check the debug logs, enable verbose logging, or review the comprehensive monitoring metrics for troubleshooting guidance.