atropos/environments/answer_format_environment
2025-06-13 11:39:36 +00:00
..
answer_format_environment.py [pre-commit.ci] auto fixes from pre-commit.com hooks 2025-06-13 11:39:36 +00:00
README.md add more info on rejection sampling in readme 2025-06-10 01:25:39 -07:00

Answer Format Environment

A comprehensive environment for teaching language models to generate responses in specific structured formats. This environment focuses on format adherence rather than answer correctness, using randomized format requirements and corresponding parsers to evaluate models on structured response generation.

⚠️ Important: Rejection Sampling Focused

This environment is primarily designed for rejection sampling, not traditional RL training. Since we only validate format compliance and do not verify answer correctness, the binary scoring (1.0 for correct format, 0.0 for incorrect) makes it less suitable for gradient-based RL methods. Instead, it excels at:

  • Rejection Sampling: Filter model outputs based on format compliance
  • Format Evaluation: Assess model capabilities across different structured formats
  • Data Curation: Generate format-compliant training data
  • Format Benchmarking: Compare model performance on format adherence tasks

🎯 Overview

The Answer Format Environment evaluates models on:

  • Generating responses in 150+ different structured formats
  • Following strict thinking tag discipline (<think></think>)
  • Format compliance validation and parsing
  • Handling multiple dataset types with appropriate format selection
  • Maintaining equivalent evaluation ratios across formats (optional)

Key Philosophy: This environment scores based on format compliance (1.0 for correct format, 0.0 for incorrect), not answer accuracy. It teaches models to follow formatting instructions precisely without validating the correctness of the actual answers.

Key Features

🔄 Randomized Format Selection

  • 150+ supported answer formats across multiple categories
  • Weighted format selection (70% simple, 30% complex)
  • Dataset type-aware format filtering (generic, math_only, code_only)
  • Dynamic compositor system for complex structured responses

🧠 Thinking Tag Validation

  • Enforces exactly one <think></think> section per response
  • All reasoning must be contained within thinking tags
  • Strict validation prevents multiple thinking sections
  • Answer must appear after </think> in specified format

📊 Comprehensive Data Management

  • Multi-dataset support with automatic shuffling
  • Configurable train/eval splits
  • Extensive data dumping with group-level statistics
  • Failed rollout tracking and analysis
  • WandB integration with detailed metrics

⚖️ Equivalent Ratio Enforcement

  • Optional system to ensure balanced evaluation across formats
  • Pauses formats after reaching success threshold
  • Prevents format bias in evaluation data
  • Comprehensive monitoring and status reporting

🔍 Advanced Monitoring

  • Format success rate tracking
  • Group-level performance statistics
  • Failure case analysis and logging
  • Real-time format balance monitoring

📋 Supported Format Categories

Basic Structured Data

{"answer": "content"}                    // JSON
answer: content                          // YAML
answer = "content"                       // TOML

XML/HTML Tags

<answer>content</answer>                 // XML
<answer>Final Answer: content</answer>   // XML with prefix
<output>content</output>                 // Output tags
<result>content</result>                 // Result tags

LaTeX Formats

\boxed{content}                          // Text-friendly boxed
$\boxed{expression}$                     // Math-only boxed
\begin{align} expression \end{align}     // Math alignment
$\text{answer}$                          // Text in math mode

Natural Language

The answer is: content
Final answer: content
In conclusion: content
Therefore: content

Programming Formats

print("answer")                          // Python print
console.log("answer")                    // JavaScript console
# answer                                 // Python comment
return "answer"                          // Return statement

Complex Multi-Tag Formats

<restatement>...</restatement>
<reasoning>...</reasoning>
<solution>...</solution>
<explanation>...</explanation>

Dynamic Compositor Formats

Randomly combines 3-6 components in XML, JSON, YAML, or TOML:

  • Analysis components (problem_analysis, requirements_analysis, etc.)
  • Reasoning components (logical_reasoning, step_by_step, etc.)
  • Planning components (approach, methodology, etc.)
  • Technical components (implementation, code_structure, etc.)
  • Evaluation components (validation, testing, etc.)
  • Output components (final_answer, conclusion, etc.)

🚀 Quick Start

Basic Configuration

from atropos.environments.answer_format_environment import AnswerFormatEnv, AnswerFormatEnvConfig

# Simple configuration
config = AnswerFormatEnvConfig(
    dataset_configs=[
        {
            "name": "your_dataset",
            "split": "train",
            "sample_size": 1000,
            "prompt_field": "question",
            "answer_field": "answer",
            "dataset_type": "generic"
        }
    ],
    debug_logging=True,
    dump_rollouts=True,
    eval_set_percentage=0.1
)

# Initialize environment
env = AnswerFormatEnv(config, server_configs)

Multi-Dataset Configuration

config = AnswerFormatEnvConfig(
    dataset_configs=[
        {
            "name": "teknium/OpenHermes-2.5",
            "split": "train",
            "sample_size": 5000,
            "prompt_field": "conversations",
            "answer_field": "conversations",
            "metadata_fields": ["source"],
            "dataset_type": "generic"
        },
        {
            "name": "gsm8k",
            "split": "train",
            "sample_size": 2000,
            "prompt_field": "question",
            "answer_field": "answer",
            "dataset_type": "math_only"
        },
        {
            "name": "NousResearch/AcademicMCQA",
            "split": "train",
            "sample_size": 5000,
            "prompt_field": "prompt",
            "answer_field": "ground_truth",
            "metadata_fields": ["answer", "options"],
            "dataset_type": "generic"
        }
    ],
    ensure_equivalent_ratios=True,
    format_group_threshold=50,
    dump_failed_rollouts=True
)

⚙️ Configuration Options

Dataset Configuration

dataset_configs: List[Dict[str, Any]] = [
    {
        "name": str,                    # Dataset name or HuggingFace path
        "split": str,                   # Dataset split ("train", "test", etc.)
        "sample_size": int,             # Number of samples to use
        "prompt_field": str,            # Field containing prompts/questions
        "answer_field": str,            # Field containing answers
        "metadata_fields": List[str],   # Additional fields to preserve
        "dataset_type": str             # "generic", "math_only", or "code_only"
    }
]

Core Settings

debug_logging: bool = True                    # Enable detailed logging
dump_rollouts: bool = True                    # Save rollouts to JSONL
dump_failed_rollouts: bool = True             # Save failed rollouts separately
rollout_save_score_threshold: float = 0.0    # Minimum score to save rollouts
eval_set_percentage: float = 0.1              # Evaluation set percentage

Format Control

supported_formats: Optional[List[AnswerFormat]] = None  # Filter to specific formats
ensure_equivalent_ratios: bool = False                  # Enable ratio enforcement
format_group_threshold: int = 50                        # Success threshold per format

📊 Dataset Types & Format Selection

Generic Datasets (dataset_type: "generic")

  • Available Formats: All basic formats (JSON, XML, natural language, brackets, etc.)
  • Use Case: General conversation, QA, instruction following, MCQA
  • Examples: OpenHermes, Alpaca, AcademicMCQA, general chat datasets

Math-Only Datasets (dataset_type: "math_only")

  • Available Formats: Generic formats + LaTeX math expressions
  • Additional Formats: $\boxed{}$, \begin{align}, $\text{}$, etc.
  • Use Case: Mathematical problem solving
  • Examples: GSM8K, MATH, mathematical reasoning datasets

Code-Only Datasets (dataset_type: "code_only")

  • Available Formats: Generic formats + programming-specific formats
  • Additional Formats: print(), console.log(), comments, return statements
  • Use Case: Code generation, programming problems
  • Examples: HumanEval, MBPP, coding datasets

🎲 Dynamic Compositor System

The dynamic compositor creates complex structured responses by randomly combining components:

Component Categories

  • Analysis: problem_analysis, requirements_analysis, context_analysis
  • Reasoning: logical_reasoning, step_by_step, causal_reasoning
  • Planning: approach, methodology, strategy
  • Technical: implementation, code_structure, algorithm
  • Evaluation: validation, testing, verification
  • Output: final_answer, conclusion, summary

Output Formats

  • XML: <component_name>content</component_name>
  • JSON: {"component_name": "content"}
  • YAML: component_name: content
  • TOML: component_name = "content"

Example Dynamic Format

<problem_analysis>Understanding the requirements...</problem_analysis>
<logical_reasoning>Step by step analysis...</logical_reasoning>
<approach>My methodology will be...</approach>
<implementation>Here's the solution...</implementation>
<final_answer>42</final_answer>

📈 Monitoring & Analytics

WandB Metrics

  • train/percent_correct: Overall format compliance rate
  • train/format_success_rate_{format}: Per-format success rates
  • train/format_usage_count_{format}: Usage frequency per format
  • train/equivalent_ratio_paused_formats: Number of paused formats
  • train/group_success_rate: Percentage of successful groups
  • train/failed_groups_count: Number of completely failed groups

Group-Level Statistics

Format: json_confidence | Group average score: 0.7500 | 12/16 correct (75.0%)
Format: latex_boxed_math | Group average score: 0.0000 | 0/16 correct (0.0%) (All failures!)
Format: natural_language_answer | Group average score: 1.0000 | 16/16 correct (100.0%) (Perfect group!)

Data Dumps

  • Regular Rollouts: answer_format_rollouts_{uuid}_{batch}.jsonl
  • Failed Rollouts: answer_format_failed_rollouts_{uuid}_{batch}.jsonl
  • Metadata: Format type, scores, conversation history, timestamps
  • Batch Size: 100 groups per file

⚖️ Equivalent Ratio Enforcement

Optional system to ensure balanced evaluation across all formats:

How It Works

  1. Tracks successful groups per format
  2. Pauses formats that reach threshold (default: 50 successful groups)
  3. Continues evaluating other formats until they catch up
  4. Resumes paused formats when balance is restored

Configuration

ensure_equivalent_ratios=True,    # Enable the system
format_group_threshold=50,        # Success threshold per format

Monitoring

status = env.get_equivalent_ratio_status()
print(f"Paused formats: {status['paused_formats']}")
print(f"Active formats: {status['active_formats']}")
print(f"Progress: {status['format_progress']}")

🔧 Advanced Usage

Custom Format Filtering

from atropos.environments.answer_format_environment import AnswerFormat

config = AnswerFormatEnvConfig(
    supported_formats=[
        AnswerFormat.JSON,
        AnswerFormat.XML,
        AnswerFormat.LATEX_BOXED,
        AnswerFormat.NATURAL_LANGUAGE_ANSWER
    ]
)

Evaluation Mode

# Run evaluation on held-out set
eval_score = await env.evaluate()
print(f"Evaluation format compliance: {eval_score}")

Supported Dataset Formats

OpenHermes Conversations

{
    "name": "teknium/OpenHermes-2.5",
    "prompt_field": "conversations",
    "answer_field": "conversations"
}

GSM8K Math Problems

{
    "name": "gsm8k",
    "prompt_field": "question",
    "answer_field": "answer"
}

AcademicMCQA Multiple Choice

{
    "name": "NousResearch/AcademicMCQA",
    "prompt_field": "prompt",
    "answer_field": "ground_truth",
    "metadata_fields": ["answer", "options"]
}

📝 Response Format Requirements

Thinking Tags

  • Required: Exactly one <think> opening and one </think> closing tag
  • Content: All reasoning must be inside thinking tags
  • Placement: Answer must appear after </think> in specified format
  • Validation: No additional thinking tags allowed after first </think>

Example Valid Response

<think>
Let me analyze this problem step by step.
First, I need to understand what's being asked...
The solution involves calculating...
</think>

{"answer": "42"}

Example Invalid Response

<think>Some reasoning</think>
{"answer": "42"}
<think>More reasoning</think>  // ❌ Additional thinking tags not allowed

🚨 Common Issues & Solutions

Format Validation Failures

  • Issue: Response doesn't match expected format
  • Solution: Check regex patterns and ensure exact format compliance
  • Debug: Enable debug_logging=True for detailed validation info

Thinking Tag Violations

  • Issue: Multiple thinking sections or missing tags
  • Solution: Ensure exactly one <think></think> section per response
  • Debug: Check thinking tag validation logs

Dataset Loading Errors

  • Issue: Dataset not found or field missing
  • Solution: Verify dataset name, split, and field names
  • Debug: Check dataset configuration and field mappings

Memory Issues with Large Datasets

  • Issue: Out of memory with large sample sizes
  • Solution: Reduce sample_size or use streaming datasets
  • Debug: Monitor memory usage and adjust batch sizes

🔍 Debugging & Development

Enable Debug Logging

config = AnswerFormatEnvConfig(debug_logging=True)

Check Format Status

# Get equivalent ratio status
status = env.get_equivalent_ratio_status()

# Check format success rates
for format_name, success_rate in env.format_success_rates.items():
    print(f"{format_name}: {success_rate:.2%}")

Analyze Failed Rollouts

# Failed rollouts are automatically saved when dump_failed_rollouts=True
# Check the datadumps directory for analysis files

📚 Technical Details

Scoring System

  • Format Compliance: 1.0 for correct format, 0.0 for incorrect
  • No Answer Accuracy: Content correctness is not evaluated
  • Group Scoring: Average of all rollouts in a group
  • Success Threshold: Groups with >0.0 average are considered successful

Format Validation

  • Regex Patterns: Each format has specific validation patterns
  • Exact Matching: Strict compliance required for scoring
  • Content Extraction: Validated content is extracted for consistency

Dynamic Format Generation

  • Component Selection: Random selection of 3-6 components
  • Format Templates: XML, JSON, YAML, TOML output formats
  • Validation Storage: Components stored for precise validation

Special Dataset Handling

  • GSM8K: Extracts numerical answers from #### separated format
  • AcademicMCQA: Uses ground truth letters (A, B, C, D) as answers
  • OpenHermes: Extracts from conversation format with role-based parsing

🤝 Contributing

Adding New Formats

  1. Add enum value to AnswerFormat
  2. Add system prompt instruction
  3. Add validation regex pattern
  4. Add content extraction logic
  5. Test with sample responses

Adding New Dataset Types

  1. Define dataset type in configuration
  2. Add format filtering logic
  3. Update format selection methods
  4. Test with representative datasets

Adding New Datasets

  1. Add special handling in setup() method
  2. Define field mappings in configuration
  3. Add metadata extraction logic
  4. Test dataset loading and processing

📄 License

This environment is part of the Atropos training framework. See the main repository for license information.


Need Help? Check the debug logs, enable verbose logging, or review the comprehensive monitoring metrics for troubleshooting guidance.