diff --git a/environments/answer_format_environment/README.md b/environments/answer_format_environment/README.md
new file mode 100644
index 00000000..b592adfb
--- /dev/null
+++ b/environments/answer_format_environment/README.md
@@ -0,0 +1,473 @@
+# Answer Format Environment
+
+A comprehensive training environment for teaching language models to generate responses in specific structured formats. This environment focuses on **format adherence** rather than answer correctness, using randomized format requirements and corresponding parsers to train models on structured response generation.
+
+## 🎯 Overview
+
+The Answer Format Environment trains models to:
+- Generate responses in 150+ different structured formats
+- Follow strict thinking tag discipline (``)
+- Parse and validate format compliance
+- Handle multiple dataset types with appropriate format selection
+- Maintain equivalent training ratios across formats (optional)
+
+**Key Philosophy**: This environment scores based on format compliance (1.0 for correct format, 0.0 for incorrect), not answer accuracy. It teaches models to follow formatting instructions precisely.
+
+## ✨ Key Features
+
+### 🔄 **Randomized Format Selection**
+- 150+ supported answer formats across multiple categories
+- Weighted format selection (70% simple, 30% complex)
+- Dataset type-aware format filtering (generic, math_only, code_only)
+- Dynamic compositor system for complex structured responses
+
+### 🧠 **Thinking Tag Validation**
+- Enforces exactly one `` section per response
+- All reasoning must be contained within thinking tags
+- Strict validation prevents multiple thinking sections
+- Answer must appear after `` in specified format
+
+### 📊 **Comprehensive Data Management**
+- Multi-dataset support with automatic shuffling
+- Configurable train/eval splits
+- Extensive data dumping with group-level statistics
+- Failed rollout tracking and analysis
+- WandB integration with detailed metrics
+
+### ⚖️ **Equivalent Ratio Enforcement**
+- Optional system to ensure balanced training across formats
+- Pauses formats after reaching success threshold
+- Prevents format bias in training data
+- Comprehensive monitoring and status reporting
+
+### 🔍 **Advanced Monitoring**
+- Format success rate tracking
+- Group-level performance statistics
+- Failure case analysis and logging
+- Real-time format balance monitoring
+
+## 📋 Supported Format Categories
+
+### **Basic Structured Data**
+```json
+{"answer": "content"} // JSON
+answer: content // YAML
+answer = "content" // TOML
+```
+
+### **XML/HTML Tags**
+```xml
+content // XML
+Final Answer: content // XML with prefix
+ // Output tags
+content // Result tags
+```
+
+### **LaTeX Formats**
+```latex
+\boxed{content} // Text-friendly boxed
+$\boxed{expression}$ // Math-only boxed
+\begin{align} expression \end{align} // Math alignment
+$\text{answer}$ // Text in math mode
+```
+
+### **Natural Language**
+```
+The answer is: content
+Final answer: content
+In conclusion: content
+Therefore: content
+```
+
+### **Programming Formats**
+```python
+print("answer") // Python print
+console.log("answer") // JavaScript console
+# answer // Python comment
+return "answer" // Return statement
+```
+
+### **Complex Multi-Tag Formats**
+```xml
+...
+...
+...
+...
+```
+
+### **Dynamic Compositor Formats**
+Randomly combines 3-6 components in XML, JSON, YAML, or TOML:
+- Analysis components (problem_analysis, requirements_analysis, etc.)
+- Reasoning components (logical_reasoning, step_by_step, etc.)
+- Planning components (approach, methodology, etc.)
+- Technical components (implementation, code_structure, etc.)
+- Evaluation components (validation, testing, etc.)
+- Output components (final_answer, conclusion, etc.)
+
+## 🚀 Quick Start
+
+### Basic Configuration
+
+```python
+from atropos.environments.answer_format_environment import AnswerFormatEnv, AnswerFormatEnvConfig
+
+# Simple configuration
+config = AnswerFormatEnvConfig(
+ dataset_configs=[
+ {
+ "name": "your_dataset",
+ "split": "train",
+ "sample_size": 1000,
+ "prompt_field": "question",
+ "answer_field": "answer",
+ "dataset_type": "generic"
+ }
+ ],
+ debug_logging=True,
+ dump_rollouts=True,
+ eval_set_percentage=0.1
+)
+
+# Initialize environment
+env = AnswerFormatEnv(config, server_configs)
+```
+
+### Multi-Dataset Configuration
+
+```python
+config = AnswerFormatEnvConfig(
+ dataset_configs=[
+ {
+ "name": "teknium/OpenHermes-2.5",
+ "split": "train",
+ "sample_size": 5000,
+ "prompt_field": "conversations",
+ "answer_field": "conversations",
+ "metadata_fields": ["source"],
+ "dataset_type": "generic"
+ },
+ {
+ "name": "gsm8k",
+ "split": "train",
+ "sample_size": 2000,
+ "prompt_field": "question",
+ "answer_field": "answer",
+ "dataset_type": "math_only"
+ },
+ {
+ "name": "NousResearch/AcademicMCQA",
+ "split": "train",
+ "sample_size": 5000,
+ "prompt_field": "prompt",
+ "answer_field": "ground_truth",
+ "metadata_fields": ["answer", "options"],
+ "dataset_type": "generic"
+ }
+ ],
+ ensure_equivalent_ratios=True,
+ format_group_threshold=50,
+ dump_failed_rollouts=True
+)
+```
+
+## ⚙️ Configuration Options
+
+### **Dataset Configuration**
+```python
+dataset_configs: List[Dict[str, Any]] = [
+ {
+ "name": str, # Dataset name or HuggingFace path
+ "split": str, # Dataset split ("train", "test", etc.)
+ "sample_size": int, # Number of samples to use
+ "prompt_field": str, # Field containing prompts/questions
+ "answer_field": str, # Field containing answers
+ "metadata_fields": List[str], # Additional fields to preserve
+ "dataset_type": str # "generic", "math_only", or "code_only"
+ }
+]
+```
+
+### **Core Settings**
+```python
+debug_logging: bool = True # Enable detailed logging
+dump_rollouts: bool = True # Save rollouts to JSONL
+dump_failed_rollouts: bool = True # Save failed rollouts separately
+rollout_save_score_threshold: float = 0.0 # Minimum score to save rollouts
+eval_set_percentage: float = 0.1 # Evaluation set percentage
+```
+
+### **Format Control**
+```python
+supported_formats: Optional[List[AnswerFormat]] = None # Filter to specific formats
+ensure_equivalent_ratios: bool = False # Enable ratio enforcement
+format_group_threshold: int = 50 # Success threshold per format
+```
+
+## 📊 Dataset Types & Format Selection
+
+### **Generic Datasets** (`dataset_type: "generic"`)
+- **Available Formats**: All basic formats (JSON, XML, natural language, brackets, etc.)
+- **Use Case**: General conversation, QA, instruction following, MCQA
+- **Examples**: OpenHermes, Alpaca, AcademicMCQA, general chat datasets
+
+### **Math-Only Datasets** (`dataset_type: "math_only"`)
+- **Available Formats**: Generic formats + LaTeX math expressions
+- **Additional Formats**: `$\boxed{}$`, `\begin{align}`, `$\text{}$`, etc.
+- **Use Case**: Mathematical problem solving
+- **Examples**: GSM8K, MATH, mathematical reasoning datasets
+
+### **Code-Only Datasets** (`dataset_type: "code_only"`)
+- **Available Formats**: Generic formats + programming-specific formats
+- **Additional Formats**: `print()`, `console.log()`, comments, return statements
+- **Use Case**: Code generation, programming problems
+- **Examples**: HumanEval, MBPP, coding datasets
+
+## 🎲 Dynamic Compositor System
+
+The dynamic compositor creates complex structured responses by randomly combining components:
+
+### **Component Categories**
+- **Analysis**: `problem_analysis`, `requirements_analysis`, `context_analysis`
+- **Reasoning**: `logical_reasoning`, `step_by_step`, `causal_reasoning`
+- **Planning**: `approach`, `methodology`, `strategy`
+- **Technical**: `implementation`, `code_structure`, `algorithm`
+- **Evaluation**: `validation`, `testing`, `verification`
+- **Output**: `final_answer`, `conclusion`, `summary`
+
+### **Output Formats**
+- **XML**: `content`
+- **JSON**: `{"component_name": "content"}`
+- **YAML**: `component_name: content`
+- **TOML**: `component_name = "content"`
+
+### **Example Dynamic Format**
+```xml
+Understanding the requirements...
+Step by step analysis...
+My methodology will be...
+Here's the solution...
+42
+```
+
+## 📈 Monitoring & Analytics
+
+### **WandB Metrics**
+- `train/percent_correct`: Overall format compliance rate
+- `train/format_success_rate_{format}`: Per-format success rates
+- `train/format_usage_count_{format}`: Usage frequency per format
+- `train/equivalent_ratio_paused_formats`: Number of paused formats
+- `train/group_success_rate`: Percentage of successful groups
+- `train/failed_groups_count`: Number of completely failed groups
+
+### **Group-Level Statistics**
+```
+Format: json_confidence | Group average score: 0.7500 | 12/16 correct (75.0%)
+Format: latex_boxed_math | Group average score: 0.0000 | 0/16 correct (0.0%) (All failures!)
+Format: natural_language_answer | Group average score: 1.0000 | 16/16 correct (100.0%) (Perfect group!)
+```
+
+### **Data Dumps**
+- **Regular Rollouts**: `answer_format_rollouts_{uuid}_{batch}.jsonl`
+- **Failed Rollouts**: `answer_format_failed_rollouts_{uuid}_{batch}.jsonl`
+- **Metadata**: Format type, scores, conversation history, timestamps
+- **Batch Size**: 100 groups per file
+
+## ⚖️ Equivalent Ratio Enforcement
+
+Optional system to ensure balanced training across all formats:
+
+### **How It Works**
+1. Tracks successful groups per format
+2. Pauses formats that reach threshold (default: 50 successful groups)
+3. Continues training other formats until they catch up
+4. Resumes paused formats when balance is restored
+
+### **Configuration**
+```python
+ensure_equivalent_ratios=True, # Enable the system
+format_group_threshold=50, # Success threshold per format
+```
+
+### **Monitoring**
+```python
+status = env.get_equivalent_ratio_status()
+print(f"Paused formats: {status['paused_formats']}")
+print(f"Active formats: {status['active_formats']}")
+print(f"Progress: {status['format_progress']}")
+```
+
+## 🔧 Advanced Usage
+
+### **Custom Format Filtering**
+```python
+from atropos.environments.answer_format_environment import AnswerFormat
+
+config = AnswerFormatEnvConfig(
+ supported_formats=[
+ AnswerFormat.JSON,
+ AnswerFormat.XML,
+ AnswerFormat.LATEX_BOXED,
+ AnswerFormat.NATURAL_LANGUAGE_ANSWER
+ ]
+)
+```
+
+### **Evaluation Mode**
+```python
+# Run evaluation on held-out set
+eval_score = await env.evaluate()
+print(f"Evaluation format compliance: {eval_score}")
+```
+
+### **Supported Dataset Formats**
+
+#### **OpenHermes Conversations**
+```python
+{
+ "name": "teknium/OpenHermes-2.5",
+ "prompt_field": "conversations",
+ "answer_field": "conversations"
+}
+```
+
+#### **GSM8K Math Problems**
+```python
+{
+ "name": "gsm8k",
+ "prompt_field": "question",
+ "answer_field": "answer"
+}
+```
+
+#### **AcademicMCQA Multiple Choice**
+```python
+{
+ "name": "NousResearch/AcademicMCQA",
+ "prompt_field": "prompt",
+ "answer_field": "ground_truth",
+ "metadata_fields": ["answer", "options"]
+}
+```
+
+## 📝 Response Format Requirements
+
+### **Thinking Tags**
+- **Required**: Exactly one `` opening and one `` closing tag
+- **Content**: All reasoning must be inside thinking tags
+- **Placement**: Answer must appear after `` in specified format
+- **Validation**: No additional thinking tags allowed after first ``
+
+### **Example Valid Response**
+```
+
+Let me analyze this problem step by step.
+First, I need to understand what's being asked...
+The solution involves calculating...
+
+
+{"answer": "42"}
+```
+
+### **Example Invalid Response**
+```
+Some reasoning
+{"answer": "42"}
+More reasoning // ❌ Additional thinking tags not allowed
+```
+
+## 🚨 Common Issues & Solutions
+
+### **Format Validation Failures**
+- **Issue**: Response doesn't match expected format
+- **Solution**: Check regex patterns and ensure exact format compliance
+- **Debug**: Enable `debug_logging=True` for detailed validation info
+
+### **Thinking Tag Violations**
+- **Issue**: Multiple thinking sections or missing tags
+- **Solution**: Ensure exactly one `` section per response
+- **Debug**: Check thinking tag validation logs
+
+### **Dataset Loading Errors**
+- **Issue**: Dataset not found or field missing
+- **Solution**: Verify dataset name, split, and field names
+- **Debug**: Check dataset configuration and field mappings
+
+### **Memory Issues with Large Datasets**
+- **Issue**: Out of memory with large sample sizes
+- **Solution**: Reduce `sample_size` or use streaming datasets
+- **Debug**: Monitor memory usage and adjust batch sizes
+
+## 🔍 Debugging & Development
+
+### **Enable Debug Logging**
+```python
+config = AnswerFormatEnvConfig(debug_logging=True)
+```
+
+### **Check Format Status**
+```python
+# Get equivalent ratio status
+status = env.get_equivalent_ratio_status()
+
+# Check format success rates
+for format_name, success_rate in env.format_success_rates.items():
+ print(f"{format_name}: {success_rate:.2%}")
+```
+
+### **Analyze Failed Rollouts**
+```python
+# Failed rollouts are automatically saved when dump_failed_rollouts=True
+# Check the datadumps directory for analysis files
+```
+
+## 📚 Technical Details
+
+### **Scoring System**
+- **Format Compliance**: 1.0 for correct format, 0.0 for incorrect
+- **No Answer Accuracy**: Content correctness is not evaluated
+- **Group Scoring**: Average of all rollouts in a group
+- **Success Threshold**: Groups with >0.0 average are considered successful
+
+### **Format Validation**
+- **Regex Patterns**: Each format has specific validation patterns
+- **Exact Matching**: Strict compliance required for scoring
+- **Content Extraction**: Validated content is extracted for consistency
+
+### **Dynamic Format Generation**
+- **Component Selection**: Random selection of 3-6 components
+- **Format Templates**: XML, JSON, YAML, TOML output formats
+- **Validation Storage**: Components stored for precise validation
+
+### **Special Dataset Handling**
+- **GSM8K**: Extracts numerical answers from `####` separated format
+- **AcademicMCQA**: Uses ground truth letters (A, B, C, D) as answers
+- **OpenHermes**: Extracts from conversation format with role-based parsing
+
+## 🤝 Contributing
+
+### **Adding New Formats**
+1. Add enum value to `AnswerFormat`
+2. Add system prompt instruction
+3. Add validation regex pattern
+4. Add content extraction logic
+5. Test with sample responses
+
+### **Adding New Dataset Types**
+1. Define dataset type in configuration
+2. Add format filtering logic
+3. Update format selection methods
+4. Test with representative datasets
+
+### **Adding New Datasets**
+1. Add special handling in `setup()` method
+2. Define field mappings in configuration
+3. Add metadata extraction logic
+4. Test dataset loading and processing
+
+## 📄 License
+
+This environment is part of the Atropos training framework. See the main repository for license information.
+
+---
+
+**Need Help?** Check the debug logs, enable verbose logging, or review the comprehensive monitoring metrics for troubleshooting guidance.
\ No newline at end of file
diff --git a/environments/answer_format_environment/answer_format_environment.py b/environments/answer_format_environment/answer_format_environment.py
new file mode 100644
index 00000000..e78503c3
--- /dev/null
+++ b/environments/answer_format_environment/answer_format_environment.py
@@ -0,0 +1,4258 @@
+"""
+Answer Format Environment
+
+This environment trains models to generate responses in specific formats.
+It focuses on format adherence rather than answer correctness, using randomized
+format requirements and corresponding parsers.
+
+Key Features:
+- Randomized answer format selection from 150+ supported formats
+- Strict thinking tag validation (exactly one section)
+- Format-specific parsers for validation
+- Support for multiple input datasets that get shuffled together
+- Dataset type-aware format selection (generic, math_only, code_only)
+- Dynamic compositor system for complex structured responses
+- Comprehensive data dumping and logging following environment conventions
+- Format compliance scoring (1.0 for correct format, 0.0 for incorrect)
+- Format success rate tracking and monitoring
+- Weighted format selection for balanced training
+- Optional equivalent ratio enforcement (stops generating formats after N successful groups)
+
+Supported Answer Formats:
+- Basic structured data: JSON, YAML, TOML (with confidence scores)
+- XML/HTML tags: ,