* minimal implementation, simplified challenge registry * need game save logic * fixed challenge gen, works with local test * updated challenge gen with wider ranges, working with local script * runs working correctly, wandb stats look ok * linting * Add diplomacy environment with AI_Diplomacy submodule - Add diplomacy_env_minimal.py for diplomacy game environment - Add atropos_client_minimal.py for client interface - Add diplomacy_local_server.py for local game server - Add AI_Diplomacy submodule from GoodStartLabs/AI_Diplomacy - Fix import ordering and remove unused imports * test file working, moving to cluster to test training * updated gitignore * removed logs * minor fixes, training running now * readded proxy reg and queue system * linting * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * queue gameid bug, refactored * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleaned up configs & allowed for openrouter models to be easily used * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * linting * Remove duplicate dependencies from diplomacy requirements.txt Only keep AI_Diplomacy-specific dependencies that aren't already in the main project --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| reasoning-gym@9e79fc84b6 | ||
| README.md | ||
| reasoning_gym_environment.py | ||
| requirements.txt | ||
ReasoningGym Environment
A reinforcement learning environment for training language models on diverse reasoning tasks using the reasoning-gym library.
Overview
The ReasoningGym environment provides access to 100+ reasoning tasks spanning mathematics, logic, programming, and more. It supports:
- Diverse Task Types: Arithmetic, algebra, logic puzzles, programming challenges, and more
- Advanced Complexity Control: Three modes for managing task difficulty (None, Random, Curriculum)
- Adaptive Curriculum Learning: Automatic difficulty adjustment based on model performance
- Strict Answer Format Enforcement: Models must use
<answer>tags or receive 0 score - Dual-Format Scoring: Tries both raw answers and tagged answers, using the higher score
- Data Collection: Optional rollout dumping for successful and failed attempts
- Comprehensive Logging: Detailed progress tracking and debugging information
Features
Task Diversity
- 102 tasks from reasoning-gym with full complexity control coverage
- Automatic task discovery from the reasoning-gym registry
- Fallback to comprehensive task list if discovery fails
- Categories include: Arithmetic, Games, Logic, Algorithmic, Cognition, Algebra, Geometry, Code, Graph, ARC, GSM Symbolic, and more
Complexity Control System
Three Complexity Modes
- None (Default): Uses reasoning-gym's default parameters for all tasks
- Random: Randomizes complexity for each problem (0.0-1.0 scale)
- Curriculum: Adaptive difficulty that adjusts based on model performance
Curriculum Learning Features
- Per-task tracking: Each task has independent complexity management
- Target accuracy: Maintains configurable target accuracy (default 70%)
- Immediate adjustment: Complexity updates after each group is scored
- Stability detection: Considers performance variance for robust adjustments
- Fast-track adjustments: Special handling for very high/low accuracy
- Comprehensive monitoring: Detailed curriculum statistics for wandb logging
Task Coverage
All 102 reasoning-gym tasks have complexity mappings with realistic parameter ranges:
Arithmetic Tasks (15+ tasks):
basic_arithmetic,leg_counting,decimal_arithmetic,complex_arithmeticfraction_simplification,bitwise_arithmetic,chain_sum,count_bitsgcd,lcm,prime_factorization,power_function,productstime_intervals,calendar_arithmetic,dice,number_format
Games (15+ tasks):
n_queens,sudoku,mini_sudoku,futoshiki,tower_of_hanoimaze,sokoban,rush_hour,puzzle24,countdown,tsumegoknight_swap,emoji_mystery,mahjong_puzzle,boxnet
Logic (8+ tasks):
self_reference,propositional_logic,knights_knaves,syllogismcircuit_logic,zebra_puzzles,aiw
Algorithmic (30+ tasks):
graph_color,shortest_path,largest_island,course_schedulestring_manipulation,palindrome_generation,word_ladderbinary_matrix,spiral_matrix,number_sorting, and many more
And all other categories: Cognition, Algebra, Geometry, Code, Graph, ARC, GSM Symbolic, Induction
Scoring System
- Binary Tasks: 0.0 or 1.0 (most tasks)
- Partial Credit: Some tasks like GSM Symbolic give 0.01 for wrong but valid numbers
- Continuous Scoring: Word Ladder, Sentence Reordering use percentage-based scoring
- Length Penalty: Applied to overly long responses when all are correct
Data Collection
- Successful Rollouts: Save groups with scores above configurable threshold
- Failed Rollouts: Save completely failed groups (all 0 scores) for debugging
- Progress Tracking: Shows buffer progress toward save thresholds
- JSONL Format: Easy to process saved data
Configuration
Key Parameters
class ReasoningGymEnvConfig(BaseEnvConfig):
# Data collection
dump_rollouts: bool = False # Save successful rollouts
dump_failed_rollouts: bool = False # Save failed rollouts for debugging
rollout_save_score_threshold: float = 0.7 # Minimum score to save group
# Complexity control
complexity_mode: Optional[Literal["curriculum", "random"]] = None
curriculum_target_accuracy: float = 0.7 # Target accuracy for curriculum mode
# Evaluation
num_eval_samples_per_task: int = 5 # Samples per task for evaluation
eval_seed: int = 123 # Fixed seed for reproducible evaluation
# Logging and debugging
debug_logging: bool = False # Enable verbose logging
suppress_base_env_logs: bool = True # Hide base environment logs
seed: int = 42 # Random seed for reproducibility
Example Configurations
Basic Training (Default Complexity)
env_config = ReasoningGymEnvConfig(
tokenizer_name="NousResearch/DeepHermes-3-Llama-3-8B-Preview",
group_size=16,
max_token_length=1024 * 16,
complexity_mode=None, # Use default parameters
dump_rollouts=True,
)
Random Complexity Training
env_config = ReasoningGymEnvConfig(
tokenizer_name="NousResearch/DeepHermes-3-Llama-3-8B-Preview",
group_size=16,
max_token_length=1024 * 16,
complexity_mode="random", # Randomize difficulty
dump_rollouts=True,
debug_logging=True,
)
Curriculum Learning
env_config = ReasoningGymEnvConfig(
tokenizer_name="NousResearch/DeepHermes-3-Llama-3-8B-Preview",
group_size=16,
max_token_length=1024 * 16,
complexity_mode="curriculum", # Adaptive difficulty
curriculum_target_accuracy=0.7, # Maintain 70% accuracy
dump_rollouts=True,
debug_logging=True,
)
Setup
Prerequisites
-
reasoning-gym submodule: Clone the reasoning-gym repository as a submodule:
cd atropos/environments/reasoning_gym_environment/ git submodule add https://github.com/reasoning-gym/reasoning-gym.git reasoning-gym -
Dependencies: Install requirements:
pip install -r requirements.txt
Directory Structure
reasoning_gym_environment/
├── reasoning_gym_environment.py # Main environment code
├── reasoning-gym/ # Git submodule
├── data_dumps/ # Generated rollout data (created automatically)
├── requirements.txt # Dependencies
└── README.md # This file
Usage
Basic Training
from atropos.environments.reasoning_gym_environment import ReasoningGymEnv
# Initialize environment
env_config, server_configs = ReasoningGymEnv.config_init()
env = ReasoningGymEnv(env_config, server_configs)
# Setup and run
await env.setup()
# Training loop handled by atropos framework
Command Line
python reasoning_gym_environment.py
Monitoring Curriculum Learning
When using curriculum mode, the environment logs detailed statistics:
# Get curriculum statistics
stats = env.get_curriculum_stats()
print(f"Total tasks tracked: {stats['total_tasks_tracked']}")
print(f"Tasks with adjustments: {stats['tasks_with_adjustments']}")
print(f"Average complexity: {stats['avg_complexity']:.2f}")
Curriculum metrics are automatically logged to wandb:
curriculum/total_tasks_trackedcurriculum/tasks_with_adjustmentscurriculum/avg_complexitycurriculum/avg_recent_accuracy
Complexity Control Details
Parameter Mappings
Each task has carefully crafted complexity parameter mappings based on examination of reasoning-gym source code:
Example: Basic Arithmetic
"basic_arithmetic": {
"min_terms": int(2 + complexity_level * 4), # 2-6 terms
"max_terms": int(2 + complexity_level * 4),
"min_digits": int(1 + complexity_level * 3), # 1-4 digits
"max_digits": int(1 + complexity_level * 3),
"allow_parentheses": complexity_level > 0.3,
"allow_negation": complexity_level > 0.5,
}
Example: N-Queens
"n_queens": {
"n": int(4 + complexity_level * 8), # 4-12 board size
"min_remove": int(1 + complexity_level * 6), # 1-7 pieces removed
"max_remove": int(1 + complexity_level * 6),
}
Curriculum Algorithm
The curriculum system uses the following logic:
- Initialization: All tasks start at 30% complexity
- Tracking: Each task maintains independent performance history (last 10 groups)
- Adjustment Trigger: Requires ≥3 groups before making adjustments
- Target Accuracy: Default 70%, configurable
- Adjustment Logic:
- If accuracy > target + 5%: Increase complexity by 5%
- If accuracy < target - 5%: Decrease complexity by 5%
- Special fast-track for very high (>90%) or very low (<30%) accuracy
- Stability: Considers performance variance to avoid erratic adjustments
Complexity Ranges
All parameter ranges are based on actual reasoning-gym defaults with reasonable variations:
- Integer parameters: Properly converted with
int() - Float parameters: Only used where appropriate (e.g., edge probabilities)
- Boolean parameters: Threshold-based activation
- Reasonable bounds: No extreme values that would break tasks
System Prompt
The environment uses a structured reasoning prompt that encourages models to:
- Use
<think>tags for internal reasoning - Provide final answers in
<answer>tags - Follow strict format requirements
Example model response:
<think>
This is a math problem. Let me work through it step by step.
2 + 3 = 5
</think>
Looking at this problem, I need to add 2 and 3.
<answer>5</answer>
Data Output
Successful Rollouts
Saved to data_dumps/reasoning_gym_environment_rollouts_{uuid}_{batch}.jsonl:
{
"item_id": "gsm_symbolic",
"rollouts": [
{
"conversation": [
{"role": "system", "content": "..."},
{"role": "user", "content": "What is 2 + 3?"},
{"role": "assistant", "content": "<think>2 + 3 = 5</think>\n<answer>5</answer>"}
],
"score": 1.0
}
]
}
Failed Rollouts
Saved to data_dumps/reasoning_gym_environment_FAILED_rollouts_{uuid}_{batch}.jsonl with same format but all scores are 0.0.
Logging
The environment provides comprehensive logging:
Standard Logging
- Setup: Task discovery and initialization
- Training: Group scores, task selection, progress tracking
- Data Dumping: Save progress and file creation
- Format Violations: When models don't follow answer tag requirements
Curriculum Logging
- Complexity Adjustments: Real-time difficulty changes per task
- Performance Tracking: Accuracy trends and stability metrics
- Target Achievement: When tasks reach optimal difficulty zones
Debug Mode
Enable with debug_logging=True for detailed information:
- Answer extraction attempts
- Scoring method comparisons
- Format violation details
- Task selection patterns
- Complexity parameter usage
Task Examples
Mathematics
- GSM Symbolic: Grade school math with symbolic reasoning
- Basic Arithmetic: Addition, subtraction, multiplication, division with configurable complexity
- Algebra: Linear equations and polynomial manipulation
Logic
- Sudoku: Classic number placement puzzles with variable difficulty
- Propositional Logic: Boolean reasoning tasks with adjustable clause counts
- Knights and Knaves: Logic puzzles with configurable people and statements
Programming
- ARC: Abstract reasoning corpus visual patterns
- Code Generation: Simple programming challenges
- Algorithm Design: Sorting, searching, and optimization with scalable complexity
Games
- N-Queens: Chess queen placement with variable board sizes
- Tower of Hanoi: Disk movement puzzles with adjustable disk counts
- Rush Hour: Traffic jam puzzles with configurable car counts
Troubleshooting
Common Issues
- No tasks discovered: Ensure reasoning-gym submodule is properly initialized
- Import errors: Check that requirements.txt dependencies are installed
- No rollouts saved: Verify
dump_rollouts=Trueand scores exceed threshold - Format violations: Models not using
<answer>tags receive 0 scores - Curriculum not adjusting: Ensure tasks get enough groups (≥3) for adjustments
Debug Mode
Enable debug logging for detailed information:
env_config.debug_logging = True
This shows:
- Answer extraction attempts
- Scoring method comparisons
- Format violation details
- Task selection patterns
- Complexity parameter mappings
- Curriculum adjustment decisions
Curriculum Monitoring
Monitor curriculum effectiveness:
# Check curriculum statistics
stats = env.get_curriculum_stats()
for task, details in stats['task_details'].items():
if details['adjustable']:
print(f"{task}: complexity={details['complexity']:.2f}, "
f"accuracy={details['recent_accuracy']:.2f}")
Performance Considerations
Complexity Modes
- None: Fastest, no overhead
- Random: Minimal overhead, good for exploration
- Curriculum: Slight overhead for tracking, optimal for learning
Memory Usage
- Curriculum mode stores performance history (last 10 groups per task)
- Typical memory overhead: <1MB for all 102 tasks
Convergence
- Curriculum typically converges to target accuracy within 50-100 groups per task
- Fast-track adjustments help with extreme performance cases
- Stability detection prevents oscillation around target
Advanced Usage
Custom Complexity Mappings
To add complexity control for new tasks:
def _get_complexity_params_for_task(self, task_name: str, complexity_level: float):
# Add your custom task mapping
if task_name == "my_custom_task":
return {
"difficulty": int(1 + complexity_level * 9), # 1-10
"size": int(5 + complexity_level * 15), # 5-20
}
# ... existing mappings
Curriculum Customization
Adjust curriculum parameters:
# More aggressive curriculum
env_config.curriculum_target_accuracy = 0.8 # Higher target
# In _adjust_task_complexity, modify:
adjustment_threshold = 0.03 # Smaller threshold for more frequent adjustments
complexity_step = 0.1 # Larger steps for faster adaptation
Integration with External Systems
The environment supports integration with external curriculum systems:
# Override complexity for specific tasks
env.task_complexity_levels["basic_arithmetic"] = 0.8 # Set to 80% complexity
env.task_complexity_levels["n_queens"] = 0.3 # Set to 30% complexity