# Answer Format Environment A comprehensive training environment for teaching language models to generate responses in specific structured formats. This environment focuses on **format adherence** rather than answer correctness, using randomized format requirements and corresponding parsers to train models on structured response generation. ## 🎯 Overview The Answer Format Environment trains models to: - Generate responses in 150+ different structured formats - Follow strict thinking tag discipline (``) - Parse and validate format compliance - Handle multiple dataset types with appropriate format selection - Maintain equivalent training ratios across formats (optional) **Key Philosophy**: This environment scores based on format compliance (1.0 for correct format, 0.0 for incorrect), not answer accuracy. It teaches models to follow formatting instructions precisely. ## ✨ Key Features ### 🔄 **Randomized Format Selection** - 150+ supported answer formats across multiple categories - Weighted format selection (70% simple, 30% complex) - Dataset type-aware format filtering (generic, math_only, code_only) - Dynamic compositor system for complex structured responses ### 🧠 **Thinking Tag Validation** - Enforces exactly one `` section per response - All reasoning must be contained within thinking tags - Strict validation prevents multiple thinking sections - Answer must appear after `` in specified format ### 📊 **Comprehensive Data Management** - Multi-dataset support with automatic shuffling - Configurable train/eval splits - Extensive data dumping with group-level statistics - Failed rollout tracking and analysis - WandB integration with detailed metrics ### ⚖️ **Equivalent Ratio Enforcement** - Optional system to ensure balanced training across formats - Pauses formats after reaching success threshold - Prevents format bias in training data - Comprehensive monitoring and status reporting ### 🔍 **Advanced Monitoring** - Format success rate tracking - Group-level performance statistics - Failure case analysis and logging - Real-time format balance monitoring ## 📋 Supported Format Categories ### **Basic Structured Data** ```json {"answer": "content"} // JSON answer: content // YAML answer = "content" // TOML ``` ### **XML/HTML Tags** ```xml content // XML Final Answer: content // XML with prefix content // Output tags content // Result tags ``` ### **LaTeX Formats** ```latex \boxed{content} // Text-friendly boxed $\boxed{expression}$ // Math-only boxed \begin{align} expression \end{align} // Math alignment $\text{answer}$ // Text in math mode ``` ### **Natural Language** ``` The answer is: content Final answer: content In conclusion: content Therefore: content ``` ### **Programming Formats** ```python print("answer") // Python print console.log("answer") // JavaScript console # answer // Python comment return "answer" // Return statement ``` ### **Complex Multi-Tag Formats** ```xml ... ... ... ... ``` ### **Dynamic Compositor Formats** Randomly combines 3-6 components in XML, JSON, YAML, or TOML: - Analysis components (problem_analysis, requirements_analysis, etc.) - Reasoning components (logical_reasoning, step_by_step, etc.) - Planning components (approach, methodology, etc.) - Technical components (implementation, code_structure, etc.) - Evaluation components (validation, testing, etc.) - Output components (final_answer, conclusion, etc.) ## 🚀 Quick Start ### Basic Configuration ```python from atropos.environments.answer_format_environment import AnswerFormatEnv, AnswerFormatEnvConfig # Simple configuration config = AnswerFormatEnvConfig( dataset_configs=[ { "name": "your_dataset", "split": "train", "sample_size": 1000, "prompt_field": "question", "answer_field": "answer", "dataset_type": "generic" } ], debug_logging=True, dump_rollouts=True, eval_set_percentage=0.1 ) # Initialize environment env = AnswerFormatEnv(config, server_configs) ``` ### Multi-Dataset Configuration ```python config = AnswerFormatEnvConfig( dataset_configs=[ { "name": "teknium/OpenHermes-2.5", "split": "train", "sample_size": 5000, "prompt_field": "conversations", "answer_field": "conversations", "metadata_fields": ["source"], "dataset_type": "generic" }, { "name": "gsm8k", "split": "train", "sample_size": 2000, "prompt_field": "question", "answer_field": "answer", "dataset_type": "math_only" }, { "name": "NousResearch/AcademicMCQA", "split": "train", "sample_size": 5000, "prompt_field": "prompt", "answer_field": "ground_truth", "metadata_fields": ["answer", "options"], "dataset_type": "generic" } ], ensure_equivalent_ratios=True, format_group_threshold=50, dump_failed_rollouts=True ) ``` ## ⚙️ Configuration Options ### **Dataset Configuration** ```python dataset_configs: List[Dict[str, Any]] = [ { "name": str, # Dataset name or HuggingFace path "split": str, # Dataset split ("train", "test", etc.) "sample_size": int, # Number of samples to use "prompt_field": str, # Field containing prompts/questions "answer_field": str, # Field containing answers "metadata_fields": List[str], # Additional fields to preserve "dataset_type": str # "generic", "math_only", or "code_only" } ] ``` ### **Core Settings** ```python debug_logging: bool = True # Enable detailed logging dump_rollouts: bool = True # Save rollouts to JSONL dump_failed_rollouts: bool = True # Save failed rollouts separately rollout_save_score_threshold: float = 0.0 # Minimum score to save rollouts eval_set_percentage: float = 0.1 # Evaluation set percentage ``` ### **Format Control** ```python supported_formats: Optional[List[AnswerFormat]] = None # Filter to specific formats ensure_equivalent_ratios: bool = False # Enable ratio enforcement format_group_threshold: int = 50 # Success threshold per format ``` ## 📊 Dataset Types & Format Selection ### **Generic Datasets** (`dataset_type: "generic"`) - **Available Formats**: All basic formats (JSON, XML, natural language, brackets, etc.) - **Use Case**: General conversation, QA, instruction following, MCQA - **Examples**: OpenHermes, Alpaca, AcademicMCQA, general chat datasets ### **Math-Only Datasets** (`dataset_type: "math_only"`) - **Available Formats**: Generic formats + LaTeX math expressions - **Additional Formats**: `$\boxed{}$`, `\begin{align}`, `$\text{}$`, etc. - **Use Case**: Mathematical problem solving - **Examples**: GSM8K, MATH, mathematical reasoning datasets ### **Code-Only Datasets** (`dataset_type: "code_only"`) - **Available Formats**: Generic formats + programming-specific formats - **Additional Formats**: `print()`, `console.log()`, comments, return statements - **Use Case**: Code generation, programming problems - **Examples**: HumanEval, MBPP, coding datasets ## 🎲 Dynamic Compositor System The dynamic compositor creates complex structured responses by randomly combining components: ### **Component Categories** - **Analysis**: `problem_analysis`, `requirements_analysis`, `context_analysis` - **Reasoning**: `logical_reasoning`, `step_by_step`, `causal_reasoning` - **Planning**: `approach`, `methodology`, `strategy` - **Technical**: `implementation`, `code_structure`, `algorithm` - **Evaluation**: `validation`, `testing`, `verification` - **Output**: `final_answer`, `conclusion`, `summary` ### **Output Formats** - **XML**: `content` - **JSON**: `{"component_name": "content"}` - **YAML**: `component_name: content` - **TOML**: `component_name = "content"` ### **Example Dynamic Format** ```xml Understanding the requirements... Step by step analysis... My methodology will be... Here's the solution... 42 ``` ## 📈 Monitoring & Analytics ### **WandB Metrics** - `train/percent_correct`: Overall format compliance rate - `train/format_success_rate_{format}`: Per-format success rates - `train/format_usage_count_{format}`: Usage frequency per format - `train/equivalent_ratio_paused_formats`: Number of paused formats - `train/group_success_rate`: Percentage of successful groups - `train/failed_groups_count`: Number of completely failed groups ### **Group-Level Statistics** ``` Format: json_confidence | Group average score: 0.7500 | 12/16 correct (75.0%) Format: latex_boxed_math | Group average score: 0.0000 | 0/16 correct (0.0%) (All failures!) Format: natural_language_answer | Group average score: 1.0000 | 16/16 correct (100.0%) (Perfect group!) ``` ### **Data Dumps** - **Regular Rollouts**: `answer_format_rollouts_{uuid}_{batch}.jsonl` - **Failed Rollouts**: `answer_format_failed_rollouts_{uuid}_{batch}.jsonl` - **Metadata**: Format type, scores, conversation history, timestamps - **Batch Size**: 100 groups per file ## ⚖️ Equivalent Ratio Enforcement Optional system to ensure balanced training across all formats: ### **How It Works** 1. Tracks successful groups per format 2. Pauses formats that reach threshold (default: 50 successful groups) 3. Continues training other formats until they catch up 4. Resumes paused formats when balance is restored ### **Configuration** ```python ensure_equivalent_ratios=True, # Enable the system format_group_threshold=50, # Success threshold per format ``` ### **Monitoring** ```python status = env.get_equivalent_ratio_status() print(f"Paused formats: {status['paused_formats']}") print(f"Active formats: {status['active_formats']}") print(f"Progress: {status['format_progress']}") ``` ## 🔧 Advanced Usage ### **Custom Format Filtering** ```python from atropos.environments.answer_format_environment import AnswerFormat config = AnswerFormatEnvConfig( supported_formats=[ AnswerFormat.JSON, AnswerFormat.XML, AnswerFormat.LATEX_BOXED, AnswerFormat.NATURAL_LANGUAGE_ANSWER ] ) ``` ### **Evaluation Mode** ```python # Run evaluation on held-out set eval_score = await env.evaluate() print(f"Evaluation format compliance: {eval_score}") ``` ### **Supported Dataset Formats** #### **OpenHermes Conversations** ```python { "name": "teknium/OpenHermes-2.5", "prompt_field": "conversations", "answer_field": "conversations" } ``` #### **GSM8K Math Problems** ```python { "name": "gsm8k", "prompt_field": "question", "answer_field": "answer" } ``` #### **AcademicMCQA Multiple Choice** ```python { "name": "NousResearch/AcademicMCQA", "prompt_field": "prompt", "answer_field": "ground_truth", "metadata_fields": ["answer", "options"] } ``` ## 📝 Response Format Requirements ### **Thinking Tags** - **Required**: Exactly one `` opening and one `` closing tag - **Content**: All reasoning must be inside thinking tags - **Placement**: Answer must appear after `` in specified format - **Validation**: No additional thinking tags allowed after first `` ### **Example Valid Response** ``` Let me analyze this problem step by step. First, I need to understand what's being asked... The solution involves calculating... {"answer": "42"} ``` ### **Example Invalid Response** ``` Some reasoning {"answer": "42"} More reasoning // ❌ Additional thinking tags not allowed ``` ## 🚨 Common Issues & Solutions ### **Format Validation Failures** - **Issue**: Response doesn't match expected format - **Solution**: Check regex patterns and ensure exact format compliance - **Debug**: Enable `debug_logging=True` for detailed validation info ### **Thinking Tag Violations** - **Issue**: Multiple thinking sections or missing tags - **Solution**: Ensure exactly one `` section per response - **Debug**: Check thinking tag validation logs ### **Dataset Loading Errors** - **Issue**: Dataset not found or field missing - **Solution**: Verify dataset name, split, and field names - **Debug**: Check dataset configuration and field mappings ### **Memory Issues with Large Datasets** - **Issue**: Out of memory with large sample sizes - **Solution**: Reduce `sample_size` or use streaming datasets - **Debug**: Monitor memory usage and adjust batch sizes ## 🔍 Debugging & Development ### **Enable Debug Logging** ```python config = AnswerFormatEnvConfig(debug_logging=True) ``` ### **Check Format Status** ```python # Get equivalent ratio status status = env.get_equivalent_ratio_status() # Check format success rates for format_name, success_rate in env.format_success_rates.items(): print(f"{format_name}: {success_rate:.2%}") ``` ### **Analyze Failed Rollouts** ```python # Failed rollouts are automatically saved when dump_failed_rollouts=True # Check the datadumps directory for analysis files ``` ## 📚 Technical Details ### **Scoring System** - **Format Compliance**: 1.0 for correct format, 0.0 for incorrect - **No Answer Accuracy**: Content correctness is not evaluated - **Group Scoring**: Average of all rollouts in a group - **Success Threshold**: Groups with >0.0 average are considered successful ### **Format Validation** - **Regex Patterns**: Each format has specific validation patterns - **Exact Matching**: Strict compliance required for scoring - **Content Extraction**: Validated content is extracted for consistency ### **Dynamic Format Generation** - **Component Selection**: Random selection of 3-6 components - **Format Templates**: XML, JSON, YAML, TOML output formats - **Validation Storage**: Components stored for precise validation ### **Special Dataset Handling** - **GSM8K**: Extracts numerical answers from `####` separated format - **AcademicMCQA**: Uses ground truth letters (A, B, C, D) as answers - **OpenHermes**: Extracts from conversation format with role-based parsing ## 🤝 Contributing ### **Adding New Formats** 1. Add enum value to `AnswerFormat` 2. Add system prompt instruction 3. Add validation regex pattern 4. Add content extraction logic 5. Test with sample responses ### **Adding New Dataset Types** 1. Define dataset type in configuration 2. Add format filtering logic 3. Update format selection methods 4. Test with representative datasets ### **Adding New Datasets** 1. Add special handling in `setup()` method 2. Define field mappings in configuration 3. Add metadata extraction logic 4. Test dataset loading and processing ## 📄 License This environment is part of the Atropos training framework. See the main repository for license information. --- **Need Help?** Check the debug logs, enable verbose logging, or review the comprehensive monitoring metrics for troubleshooting guidance.