# ReasoningGym Environment A reinforcement learning environment for training language models on diverse reasoning tasks using the [reasoning-gym](https://github.com/reasoning-gym/reasoning-gym) library. ## Overview The ReasoningGym environment provides access to 100+ reasoning tasks spanning mathematics, logic, programming, and more. It supports: - **Diverse Task Types**: Arithmetic, algebra, logic puzzles, programming challenges, and more - **Strict Answer Format Enforcement**: Models must use `` tags or receive 0 score - **Dual-Format Scoring**: Tries both raw answers and tagged answers, using the higher score - **Data Collection**: Optional rollout dumping for successful and failed attempts - **Comprehensive Logging**: Detailed progress tracking and debugging information ## Features ### Task Diversity - 100+ tasks from reasoning-gym including GSM Symbolic, ARC, Sudoku, and more - Automatic task discovery from the reasoning-gym registry - Fallback to comprehensive task list if discovery fails ### Scoring System - **Binary Tasks**: 0.0 or 1.0 (most tasks) - **Partial Credit**: Some tasks like GSM Symbolic give 0.01 for wrong but valid numbers - **Continuous Scoring**: Word Ladder, Sentence Reordering use percentage-based scoring - **Length Penalty**: Applied to overly long responses when all are correct ### Data Collection - **Successful Rollouts**: Save groups with scores above configurable threshold - **Failed Rollouts**: Save completely failed groups (all 0 scores) for debugging - **Progress Tracking**: Shows buffer progress toward save thresholds - **JSONL Format**: Easy to process saved data ## Configuration ### Key Parameters ```python class ReasoningGymEnvConfig(BaseEnvConfig): dump_rollouts: bool = False # Save successful rollouts dump_failed_rollouts: bool = False # Save failed rollouts for debugging rollout_save_score_threshold: float = 0.7 # Minimum score to save group debug_logging: bool = False # Enable verbose logging suppress_base_env_logs: bool = True # Hide base environment logs seed: int = 42 # Random seed for reproducibility ``` ### Example Configuration ```python env_config = ReasoningGymEnvConfig( tokenizer_name="NousResearch/DeepHermes-3-Llama-3-8B-Preview", group_size=16, max_token_length=1024 * 16, dump_rollouts=True, dump_failed_rollouts=True, rollout_save_score_threshold=0.7, debug_logging=True, ) ``` ## Setup ### Prerequisites 1. **reasoning-gym submodule**: Clone the reasoning-gym repository as a submodule: ```bash cd atropos/environments/reasoning_gym_environment/ git submodule add https://github.com/reasoning-gym/reasoning-gym.git reasoning-gym ``` 2. **Dependencies**: Install requirements: ```bash pip install -r requirements.txt ``` ### Directory Structure ``` reasoning_gym_environment/ ├── reasoning_gym_environment.py # Main environment code ├── reasoning-gym/ # Git submodule ├── data_dumps/ # Generated rollout data (created automatically) ├── requirements.txt # Dependencies └── README.md # This file ``` ## Usage ### Basic Training ```python from atropos.environments.reasoning_gym_environment import ReasoningGymEnv # Initialize environment env_config, server_configs = ReasoningGymEnv.config_init() env = ReasoningGymEnv(env_config, server_configs) # Setup and run await env.setup() # Training loop handled by atropos framework ``` ### Command Line ```bash python reasoning_gym_environment.py ``` ## System Prompt The environment uses a structured reasoning prompt that encourages models to: 1. Use `` tags for internal reasoning 2. Provide final answers in `` tags 3. Follow strict format requirements Example model response: ``` This is a math problem. Let me work through it step by step. 2 + 3 = 5 Looking at this problem, I need to add 2 and 3. 5 ``` ## Data Output ### Successful Rollouts Saved to `data_dumps/reasoning_gym_environment_rollouts_{uuid}_{batch}.jsonl`: ```json { "item_id": "gsm_symbolic", "rollouts": [ { "conversation": [ {"role": "system", "content": "..."}, {"role": "user", "content": "What is 2 + 3?"}, {"role": "assistant", "content": "2 + 3 = 5\n5"} ], "score": 1.0 } ] } ``` ### Failed Rollouts Saved to `data_dumps/reasoning_gym_environment_FAILED_rollouts_{uuid}_{batch}.jsonl` with same format but all scores are 0.0. ## Logging The environment provides comprehensive logging: - **Setup**: Task discovery and initialization - **Training**: Group scores, task selection, progress tracking - **Data Dumping**: Save progress and file creation - **Format Violations**: When models don't follow answer tag requirements - **Debug Mode**: Detailed scoring and extraction information ## Task Examples ### Mathematics - **GSM Symbolic**: Grade school math with symbolic reasoning - **Basic Arithmetic**: Addition, subtraction, multiplication, division - **Algebra**: Linear equations and polynomial manipulation ### Logic - **Sudoku**: Classic number placement puzzles - **Propositional Logic**: Boolean reasoning tasks - **Knights and Knaves**: Logic puzzles with truth-tellers and liars ### Programming - **ARC**: Abstract reasoning corpus visual patterns - **Code Generation**: Simple programming challenges - **Algorithm Design**: Sorting, searching, and optimization ## Troubleshooting ### Common Issues 1. **No tasks discovered**: Ensure reasoning-gym submodule is properly initialized 2. **Import errors**: Check that requirements.txt dependencies are installed 3. **No rollouts saved**: Verify `dump_rollouts=True` and scores exceed threshold 4. **Format violations**: Models not using `` tags receive 0 scores ### Debug Mode Enable debug logging for detailed information: ```python env_config.debug_logging = True ``` This shows: - Answer extraction attempts - Scoring method comparisons - Format violation details - Task selection patterns ## Performance Notes - **Task Selection**: Random selection ensures diverse training - **Evaluation**: Fixed test set with deterministic seed for reproducible results - **Memory Usage**: Buffers are cleared after saving to prevent memory leaks - **Scoring Efficiency**: Dual-format scoring tries both methods and uses higher score ## Contributing When adding new features: 1. Maintain backward compatibility with existing configs 2. Add appropriate logging for debugging 3. Update this README with new configuration options 4. Test with both successful and failed rollout scenarios