|
|
||
|---|---|---|
| .. | ||
| yaml | ||
| .gitignore | ||
| eval.py | ||
| eval_config.py | ||
| example_config.json | ||
| example_config.yaml | ||
| generate_config.py | ||
| README.md | ||
| requirements-eval.txt | ||
Model Evaluation Framework
A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.
Evaluation Results Repository
In order to keep the main repo clean and not clutter it with evaluation traces from different models, we store all evaluation results in a separate repository: reasoning-gym-eval.
If you run evaluations and want to contribute your results, please create a pull request in the reasoning-gym-eval repository, not in the main reasoning-gym repo.
Overview
This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:
- Concurrent evaluation of multiple questions and datasets
- Customizable dataset configurations
- Automatic result aggregation and summary generation
- Rate limiting for API calls
Setup
- Install reasoning-gym in development mode:
pip install -e ..
- Install the additional dependencies required for evaluation:
pip install -r requirements-eval.txt
- Set your OpenRouter API key as an environment variable:
export OPENROUTER_API_KEY=your-api-key
- Prepare your evaluation configuration in YAML or JSON format (see example in
example_config.yaml):
# Example configuration
model: "meta-llama/llama-3.3-70b-instruct"
provider: "Hyperbolic" # Optional, can be omitted
output_dir: "results"
max_concurrent: 10
default_size: 20 # Default size for all datasets
default_seed: 42 # Default seed for all datasets
categories:
- category: "algebra"
datasets:
- dataset: "complex_arithmetic"
params:
min_real: -10
max_real: 10
min_imag: -10
max_imag: 10
- category: "arithmetic"
datasets:
- dataset: "chain_sum"
size: 12
seed: 43
params:
min_digits: 2
allow_negation: true
- dataset: "products"
size: 10
seed: 43
params:
min_digits: 2
allow_negation: true
For example, to evaluate Claude 3.5 Sonnet on algorithmic datasets:
model: "anthropic/claude-3.5-sonnet"
provider: "Anthropic"
output_dir: "results"
max_concurrent: 5
default_size: 50
default_seed: 45
categories:
- category: "algorithmic"
datasets:
- dataset: "count_primes"
- dataset: "game_of_life"
- dataset: "graph_color"
- dataset: "isomorphic_strings"
- dataset: "letter_jumble"
- dataset: "rotate_matrix"
- dataset: "sentence_reordering"
- dataset: "string_manipulation"
- dataset: "word_ladder"
- dataset: "word_sorting"
Generating Configurations
You can generate a configuration file with all registered datasets using the generate_config.py script:
python generate_config.py --output my_config.yaml --model "anthropic/claude-3.5-sonnet" --provider "Anthropic" --size 50 --seed 42
Options:
--output: Output YAML file path (default: all_datasets.yaml)--model: Model name (default: openai/gpt-4)--provider: Provider name (default: None)--size: Default dataset size (default: 100)--seed: Default dataset seed (default: 42)--include-params: Include all configuration parameters (default: False)
Running Evaluations
To run evaluations:
python eval.py --config configs/your_config.yaml
For example:
python eval.py --config example_config.yaml --full-results
The results will be stored in a directory named after the model and timestamp, containing:
summary.json- Summary of all resultsresults.json- Full results (if--full-resultsis specified)- Individual dataset results in category subdirectories
For example:
results/
└── meta-llama_llama-3.3-70b-instruct_20250227_162030/
├── summary.json
├── results.json
├── algebra/
│ └── complex_arithmetic.json
└── arithmetic/
├── chain_sum.json
└── products.json
Please upload your results to reasoning-gym-eval.