mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
Add reasoning gym env
This commit is contained in:
parent
26c0a39555
commit
79188d8d6a
5 changed files with 1537 additions and 0 deletions
216
environments/reasoning_gym_environment/README.md
Normal file
216
environments/reasoning_gym_environment/README.md
Normal file
|
|
@ -0,0 +1,216 @@
|
|||
# ReasoningGym Environment
|
||||
|
||||
A reinforcement learning environment for training language models on diverse reasoning tasks using the [reasoning-gym](https://github.com/reasoning-gym/reasoning-gym) library.
|
||||
|
||||
## Overview
|
||||
|
||||
The ReasoningGym environment provides access to 100+ reasoning tasks spanning mathematics, logic, programming, and more. It supports:
|
||||
|
||||
- **Diverse Task Types**: Arithmetic, algebra, logic puzzles, programming challenges, and more
|
||||
- **Strict Answer Format Enforcement**: Models must use `<answer>` tags or receive 0 score
|
||||
- **Dual-Format Scoring**: Tries both raw answers and tagged answers, using the higher score
|
||||
- **Data Collection**: Optional rollout dumping for successful and failed attempts
|
||||
- **Comprehensive Logging**: Detailed progress tracking and debugging information
|
||||
|
||||
## Features
|
||||
|
||||
### Task Diversity
|
||||
- 100+ tasks from reasoning-gym including GSM Symbolic, ARC, Sudoku, and more
|
||||
- Automatic task discovery from the reasoning-gym registry
|
||||
- Fallback to comprehensive task list if discovery fails
|
||||
|
||||
### Scoring System
|
||||
- **Binary Tasks**: 0.0 or 1.0 (most tasks)
|
||||
- **Partial Credit**: Some tasks like GSM Symbolic give 0.01 for wrong but valid numbers
|
||||
- **Continuous Scoring**: Word Ladder, Sentence Reordering use percentage-based scoring
|
||||
- **Length Penalty**: Applied to overly long responses when all are correct
|
||||
|
||||
### Data Collection
|
||||
- **Successful Rollouts**: Save groups with scores above configurable threshold
|
||||
- **Failed Rollouts**: Save completely failed groups (all 0 scores) for debugging
|
||||
- **Progress Tracking**: Shows buffer progress toward save thresholds
|
||||
- **JSONL Format**: Easy to process saved data
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Parameters
|
||||
|
||||
```python
|
||||
class ReasoningGymEnvConfig(BaseEnvConfig):
|
||||
dump_rollouts: bool = False # Save successful rollouts
|
||||
dump_failed_rollouts: bool = False # Save failed rollouts for debugging
|
||||
rollout_save_score_threshold: float = 0.7 # Minimum score to save group
|
||||
debug_logging: bool = False # Enable verbose logging
|
||||
suppress_base_env_logs: bool = True # Hide base environment logs
|
||||
seed: int = 42 # Random seed for reproducibility
|
||||
```
|
||||
|
||||
### Example Configuration
|
||||
|
||||
```python
|
||||
env_config = ReasoningGymEnvConfig(
|
||||
tokenizer_name="NousResearch/DeepHermes-3-Llama-3-8B-Preview",
|
||||
group_size=16,
|
||||
max_token_length=1024 * 16,
|
||||
dump_rollouts=True,
|
||||
dump_failed_rollouts=True,
|
||||
rollout_save_score_threshold=0.7,
|
||||
debug_logging=True,
|
||||
)
|
||||
```
|
||||
|
||||
## Setup
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. **reasoning-gym submodule**: Clone the reasoning-gym repository as a submodule:
|
||||
```bash
|
||||
cd atropos/environments/reasoning_gym_environment/
|
||||
git submodule add https://github.com/reasoning-gym/reasoning-gym.git reasoning-gym
|
||||
```
|
||||
|
||||
2. **Dependencies**: Install requirements:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### Directory Structure
|
||||
```
|
||||
reasoning_gym_environment/
|
||||
├── reasoning_gym_environment.py # Main environment code
|
||||
├── reasoning-gym/ # Git submodule
|
||||
├── data_dumps/ # Generated rollout data (created automatically)
|
||||
├── requirements.txt # Dependencies
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Training
|
||||
|
||||
```python
|
||||
from atropos.environments.reasoning_gym_environment import ReasoningGymEnv
|
||||
|
||||
# Initialize environment
|
||||
env_config, server_configs = ReasoningGymEnv.config_init()
|
||||
env = ReasoningGymEnv(env_config, server_configs)
|
||||
|
||||
# Setup and run
|
||||
await env.setup()
|
||||
# Training loop handled by atropos framework
|
||||
```
|
||||
|
||||
### Command Line
|
||||
|
||||
```bash
|
||||
python reasoning_gym_environment.py
|
||||
```
|
||||
|
||||
## System Prompt
|
||||
|
||||
The environment uses a structured reasoning prompt that encourages models to:
|
||||
|
||||
1. Use `<think>` tags for internal reasoning
|
||||
2. Provide final answers in `<answer>` tags
|
||||
3. Follow strict format requirements
|
||||
|
||||
Example model response:
|
||||
```
|
||||
<think>
|
||||
This is a math problem. Let me work through it step by step.
|
||||
2 + 3 = 5
|
||||
</think>
|
||||
|
||||
Looking at this problem, I need to add 2 and 3.
|
||||
|
||||
<answer>5</answer>
|
||||
```
|
||||
|
||||
## Data Output
|
||||
|
||||
### Successful Rollouts
|
||||
Saved to `data_dumps/reasoning_gym_environment_rollouts_{uuid}_{batch}.jsonl`:
|
||||
|
||||
```json
|
||||
{
|
||||
"item_id": "gsm_symbolic",
|
||||
"rollouts": [
|
||||
{
|
||||
"conversation": [
|
||||
{"role": "system", "content": "..."},
|
||||
{"role": "user", "content": "What is 2 + 3?"},
|
||||
{"role": "assistant", "content": "<think>2 + 3 = 5</think>\n<answer>5</answer>"}
|
||||
],
|
||||
"score": 1.0
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Failed Rollouts
|
||||
Saved to `data_dumps/reasoning_gym_environment_FAILED_rollouts_{uuid}_{batch}.jsonl` with same format but all scores are 0.0.
|
||||
|
||||
## Logging
|
||||
|
||||
The environment provides comprehensive logging:
|
||||
|
||||
- **Setup**: Task discovery and initialization
|
||||
- **Training**: Group scores, task selection, progress tracking
|
||||
- **Data Dumping**: Save progress and file creation
|
||||
- **Format Violations**: When models don't follow answer tag requirements
|
||||
- **Debug Mode**: Detailed scoring and extraction information
|
||||
|
||||
## Task Examples
|
||||
|
||||
### Mathematics
|
||||
- **GSM Symbolic**: Grade school math with symbolic reasoning
|
||||
- **Basic Arithmetic**: Addition, subtraction, multiplication, division
|
||||
- **Algebra**: Linear equations and polynomial manipulation
|
||||
|
||||
### Logic
|
||||
- **Sudoku**: Classic number placement puzzles
|
||||
- **Propositional Logic**: Boolean reasoning tasks
|
||||
- **Knights and Knaves**: Logic puzzles with truth-tellers and liars
|
||||
|
||||
### Programming
|
||||
- **ARC**: Abstract reasoning corpus visual patterns
|
||||
- **Code Generation**: Simple programming challenges
|
||||
- **Algorithm Design**: Sorting, searching, and optimization
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **No tasks discovered**: Ensure reasoning-gym submodule is properly initialized
|
||||
2. **Import errors**: Check that requirements.txt dependencies are installed
|
||||
3. **No rollouts saved**: Verify `dump_rollouts=True` and scores exceed threshold
|
||||
4. **Format violations**: Models not using `<answer>` tags receive 0 scores
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable debug logging for detailed information:
|
||||
```python
|
||||
env_config.debug_logging = True
|
||||
```
|
||||
|
||||
This shows:
|
||||
- Answer extraction attempts
|
||||
- Scoring method comparisons
|
||||
- Format violation details
|
||||
- Task selection patterns
|
||||
|
||||
## Performance Notes
|
||||
|
||||
- **Task Selection**: Random selection ensures diverse training
|
||||
- **Evaluation**: Fixed test set with deterministic seed for reproducible results
|
||||
- **Memory Usage**: Buffers are cleared after saving to prevent memory leaks
|
||||
- **Scoring Efficiency**: Dual-format scoring tries both methods and uses higher score
|
||||
|
||||
## Contributing
|
||||
|
||||
When adding new features:
|
||||
|
||||
1. Maintain backward compatibility with existing configs
|
||||
2. Add appropriate logging for debugging
|
||||
3. Update this README with new configuration options
|
||||
4. Test with both successful and failed rollout scenarios
|
||||
Loading…
Add table
Add a link
Reference in a new issue