6.6 KiB
ReasoningGym Environment
A reinforcement learning environment for training language models on diverse reasoning tasks using the reasoning-gym library.
Overview
The ReasoningGym environment provides access to 100+ reasoning tasks spanning mathematics, logic, programming, and more. It supports:
- Diverse Task Types: Arithmetic, algebra, logic puzzles, programming challenges, and more
- Strict Answer Format Enforcement: Models must use
<answer>tags or receive 0 score - Dual-Format Scoring: Tries both raw answers and tagged answers, using the higher score
- Data Collection: Optional rollout dumping for successful and failed attempts
- Comprehensive Logging: Detailed progress tracking and debugging information
Features
Task Diversity
- 100+ tasks from reasoning-gym including GSM Symbolic, ARC, Sudoku, and more
- Automatic task discovery from the reasoning-gym registry
- Fallback to comprehensive task list if discovery fails
Scoring System
- Binary Tasks: 0.0 or 1.0 (most tasks)
- Partial Credit: Some tasks like GSM Symbolic give 0.01 for wrong but valid numbers
- Continuous Scoring: Word Ladder, Sentence Reordering use percentage-based scoring
- Length Penalty: Applied to overly long responses when all are correct
Data Collection
- Successful Rollouts: Save groups with scores above configurable threshold
- Failed Rollouts: Save completely failed groups (all 0 scores) for debugging
- Progress Tracking: Shows buffer progress toward save thresholds
- JSONL Format: Easy to process saved data
Configuration
Key Parameters
class ReasoningGymEnvConfig(BaseEnvConfig):
dump_rollouts: bool = False # Save successful rollouts
dump_failed_rollouts: bool = False # Save failed rollouts for debugging
rollout_save_score_threshold: float = 0.7 # Minimum score to save group
debug_logging: bool = False # Enable verbose logging
suppress_base_env_logs: bool = True # Hide base environment logs
seed: int = 42 # Random seed for reproducibility
Example Configuration
env_config = ReasoningGymEnvConfig(
tokenizer_name="NousResearch/DeepHermes-3-Llama-3-8B-Preview",
group_size=16,
max_token_length=1024 * 16,
dump_rollouts=True,
dump_failed_rollouts=True,
rollout_save_score_threshold=0.7,
debug_logging=True,
)
Setup
Prerequisites
-
reasoning-gym submodule: Clone the reasoning-gym repository as a submodule:
cd atropos/environments/reasoning_gym_environment/ git submodule add https://github.com/reasoning-gym/reasoning-gym.git reasoning-gym -
Dependencies: Install requirements:
pip install -r requirements.txt
Directory Structure
reasoning_gym_environment/
├── reasoning_gym_environment.py # Main environment code
├── reasoning-gym/ # Git submodule
├── data_dumps/ # Generated rollout data (created automatically)
├── requirements.txt # Dependencies
└── README.md # This file
Usage
Basic Training
from atropos.environments.reasoning_gym_environment import ReasoningGymEnv
# Initialize environment
env_config, server_configs = ReasoningGymEnv.config_init()
env = ReasoningGymEnv(env_config, server_configs)
# Setup and run
await env.setup()
# Training loop handled by atropos framework
Command Line
python reasoning_gym_environment.py
System Prompt
The environment uses a structured reasoning prompt that encourages models to:
- Use
<think>tags for internal reasoning - Provide final answers in
<answer>tags - Follow strict format requirements
Example model response:
<think>
This is a math problem. Let me work through it step by step.
2 + 3 = 5
</think>
Looking at this problem, I need to add 2 and 3.
<answer>5</answer>
Data Output
Successful Rollouts
Saved to data_dumps/reasoning_gym_environment_rollouts_{uuid}_{batch}.jsonl:
{
"item_id": "gsm_symbolic",
"rollouts": [
{
"conversation": [
{"role": "system", "content": "..."},
{"role": "user", "content": "What is 2 + 3?"},
{"role": "assistant", "content": "<think>2 + 3 = 5</think>\n<answer>5</answer>"}
],
"score": 1.0
}
]
}
Failed Rollouts
Saved to data_dumps/reasoning_gym_environment_FAILED_rollouts_{uuid}_{batch}.jsonl with same format but all scores are 0.0.
Logging
The environment provides comprehensive logging:
- Setup: Task discovery and initialization
- Training: Group scores, task selection, progress tracking
- Data Dumping: Save progress and file creation
- Format Violations: When models don't follow answer tag requirements
- Debug Mode: Detailed scoring and extraction information
Task Examples
Mathematics
- GSM Symbolic: Grade school math with symbolic reasoning
- Basic Arithmetic: Addition, subtraction, multiplication, division
- Algebra: Linear equations and polynomial manipulation
Logic
- Sudoku: Classic number placement puzzles
- Propositional Logic: Boolean reasoning tasks
- Knights and Knaves: Logic puzzles with truth-tellers and liars
Programming
- ARC: Abstract reasoning corpus visual patterns
- Code Generation: Simple programming challenges
- Algorithm Design: Sorting, searching, and optimization
Troubleshooting
Common Issues
- No tasks discovered: Ensure reasoning-gym submodule is properly initialized
- Import errors: Check that requirements.txt dependencies are installed
- No rollouts saved: Verify
dump_rollouts=Trueand scores exceed threshold - Format violations: Models not using
<answer>tags receive 0 scores
Debug Mode
Enable debug logging for detailed information:
env_config.debug_logging = True
This shows:
- Answer extraction attempts
- Scoring method comparisons
- Format violation details
- Task selection patterns
Performance Notes
- Task Selection: Random selection ensures diverse training
- Evaluation: Fixed test set with deterministic seed for reproducible results
- Memory Usage: Buffers are cleared after saving to prevent memory leaks
- Scoring Efficiency: Dual-format scoring tries both methods and uses higher score
Contributing
When adding new features:
- Maintain backward compatibility with existing configs
- Add appropriate logging for debugging
- Update this README with new configuration options
- Test with both successful and failed rollout scenarios