mirror of https://github.com/open-thought/reasoning-gym.git synced 2026-04-19 12:58:07 +00:00

History

Andreas Köpf 4ad9d22fa3 Add base_url and api_key command line args for eval.py script (#244 ) * feat: Add base URL command line parameter to eval.py script * feat: Add API key parameter and CLI option to AsyncModelEvaluator		2025-02-28 18:32:58 +01:00
..
yaml	Generate eval config tool (#240 )	2025-02-27 21:40:53 +01:00
.gitignore	add llama-3.3-70b-instruct algebra, algorithmic eval configs	2025-02-25 23:43:29 +01:00
eval.py	Add base_url and api_key command line args for eval.py script (#244 )	2025-02-28 18:32:58 +01:00
eval_config.py	Eval sampling settings for generation (temperature, top-p, max_tokens) (#242 )	2025-02-28 11:48:37 +01:00
example_config.json	Eval script consolidation (#238 )	2025-02-27 17:39:14 +01:00
example_config.yaml	Eval script consolidation (#238 )	2025-02-27 17:39:14 +01:00
generate_config.py	Generate eval config tool (#240 )	2025-02-27 21:40:53 +01:00
README.md	Add base_url and api_key command line args for eval.py script (#244 )	2025-02-28 18:32:58 +01:00
requirements-eval.txt	Eval script consolidation (#238 )	2025-02-27 17:39:14 +01:00

README.md

Model Evaluation Framework

A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.

Evaluation Results Repository

In order to keep the main repo clean and not clutter it with evaluation traces from different models, we store all evaluation results in a separate repository: reasoning-gym-eval.

If you run evaluations and want to contribute your results, please create a pull request in the reasoning-gym-eval repository, not in the main reasoning-gym repo.

Overview

This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:

Concurrent evaluation of multiple questions and datasets
Customizable dataset configurations
Automatic result aggregation and summary generation
Rate limiting for API calls

Setup

Install reasoning-gym in development mode:

pip install -e ..

Install the additional dependencies required for evaluation:

pip install -r requirements-eval.txt

Set your API key (if required by the API):

For OpenRouter, you can set it as an environment variable:
```
export OPENROUTER_API_KEY=your-api-key
```
Or provide it directly when running the script:
```
python eval.py --config your_config.yaml --api-key your-api-key
```
Note: API key is optional for some APIs (e.g., local deployments).
Prepare your evaluation configuration in YAML or JSON format (see example in example_config.yaml):

# Example configuration
model: "meta-llama/llama-3.3-70b-instruct"
provider: "Hyperbolic"  # Optional, can be omitted
output_dir: "results"
max_concurrent: 10
default_size: 20  # Default size for all datasets
default_seed: 42  # Default seed for all datasets
max_tokens: 32768  # Maximum generation length (optional)
temperature: 0.6   # Generation temperature (optional)
top_p: 0.95        # Top-p sampling parameter (optional)
system_prompt_id: "default"  # Use a predefined system prompt by ID (optional)
# system_prompt: "Your custom system prompt here"  # Or specify a custom system prompt directly

categories:
  - category: "algebra"
    datasets:
      - dataset: "complex_arithmetic"
        params:
          min_real: -10
          max_real: 10
          min_imag: -10
          max_imag: 10

  - category: "arithmetic"
    datasets:
      - dataset: "chain_sum"
        size: 12
        seed: 43
        params:
          min_digits: 2
          allow_negation: true

      - dataset: "products"
        size: 10
        seed: 43
        params:
          min_digits: 2
          allow_negation: true

For example, to evaluate Claude 3.5 Sonnet on algorithmic datasets:

model: "anthropic/claude-3.5-sonnet"
provider: "Anthropic"
output_dir: "results"
max_concurrent: 5
default_size: 50
default_seed: 45

categories:
  - category: "algorithmic"
    datasets:
      - dataset: "count_primes"
      - dataset: "game_of_life"
      - dataset: "graph_color"
      - dataset: "isomorphic_strings"
      - dataset: "letter_jumble"
      - dataset: "rotate_matrix"
      - dataset: "sentence_reordering"
      - dataset: "string_manipulation"
      - dataset: "word_ladder"
      - dataset: "word_sorting"

Generating Configurations

You can generate a configuration file with all registered datasets using the generate_config.py script:

python generate_config.py --output my_config.yaml --model "anthropic/claude-3.5-sonnet" --provider "Anthropic" --size 50 --seed 42

Options:

--output: Output YAML file path (default: all_datasets.yaml)
--model: Model name (default: openai/gpt-4)
--provider: Provider name (default: None)
--size: Default dataset size (default: 100)
--seed: Default dataset seed (default: 42)
--include-params: Include all configuration parameters (default: False)

Running Evaluations

To run evaluations:

python eval.py --config configs/your_config.yaml

For example:

python eval.py --config example_config.yaml --full-results

You can specify a different API base URL if needed:

python eval.py --config example_config.yaml --base-url "https://api.together.xyz/v1" --api-key "your-together-api-key"

The results will be stored in a directory named after the model and timestamp, containing:

summary.json - Summary of all results
results.json - Full results (if --full-results is specified)
Individual dataset results in category subdirectories

For example:

results/
└── meta-llama_llama-3.3-70b-instruct_20250227_162030/
    ├── summary.json
    ├── results.json
    ├── algebra/
    │   └── complex_arithmetic.json
    └── arithmetic/
        ├── chain_sum.json
        └── products.json

Please upload your results to reasoning-gym-eval.