mirror of https://github.com/open-thought/reasoning-gym.git synced 2026-04-19 12:58:07 +00:00

History

Andreas Koepf 791f16ec0f use results folder name for eval results		2025-02-25 19:41:21 +01:00
..
yaml	use results folder name for eval results	2025-02-25 19:41:21 +01:00
__init__.py	updated read me	2025-02-25 15:46:43 +00:00
eval.py	Merge remote-tracking branch 'origin/consolidate_eval_script' into fix/eval	2025-02-25 18:10:07 +00:00
eval_config.py	updated config and read me	2025-02-25 16:25:16 +00:00
README.md	use results folder name for eval results	2025-02-25 19:41:21 +01:00
requirements-eval.txt	consolidate eval scripts to have single eval.py	2025-02-25 16:13:22 +01:00

README.md

Model Evaluation Framework

A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.

Evaluation Results Repository

In order to keep the main repo clean and not clutter it with evaluation traces from different models, we store all evaluation results in a separate repository: reasoning-gym-eval.

If you run evaluations and want to contribute your results, please create a pull request in the reasoning-gym-eval repository, not in the main reasoning-gym repo.

Overview

This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:

Concurrent evaluation of multiple questions and datasets
Customizable dataset configurations
Automatic result aggregation and summary generation
Rate limiting for API calls

Setup

Install reasoning-gym in development mode:

pip install -e ..

Install the additional dependencies required for evaluation:

pip install -r requirements-eval.txt

Set your OpenRouter API key as an environment variable:

export OPENROUTER_API_KEY=your-api-key

Prepare your dataset configuration in YAML format (see examples in yaml/<model_name>/algorithmic.yaml e.g yaml/r1/algorithmic.yaml):

model: model-name
provider: provider-name
category: category-name
datasets:
  - dataset1
  - dataset2
eval_dir: results/model-name
dataset_size: 50
dataset_seed: 42
developer_role: system

For example the following file will run an evaluation for deepseek r1 for algorithmic datasets.

model: deepseek/deepseek-r1
provider: Nebius
category: algorithmic
datasets:
  - ab
  -  base_conversion
  -  binary_matrix
  -  caesar_cipher
  -  count_primes
  -  game_of_life
  -  graph_color
  -  group_anagrams
  -  isomorphic_strings
  -  letter_counting
  -  letter_jumble
  -  manipulate_matrix
  -  number_filtering
  -  number_sorting
  -  palindrome
  -  pool_matrix
  -  ransom_note
  -  rotate_matrix
  -  sentence_reordering
  -  spell_backward
  -  spiral_matrix
  -  string_insertion
  -  string_manipulation
  -  string_synthesis
  -  word_ladder
  -  word_sequence_reversal
  -  word_sorting
eval_dir: results/deepseek-r1
dataset_size: 50
dataset_seed: 45
developer_role: system

The following would run Claude 3.5 on the algorithmic dataset.

model: anthropic/claude-3.5-sonnet
category: algorithmic
provider: Anthropic
datasets:
  -  count_primes
  -  game_of_life
  -  graph_color
  -  group_anagrams
  -  isomorphic_strings
  -  letter_counting
  -  letter_jumble
  -  manipulate_matrix
  -  number_filtering
  -  number_sorting
  -  palindrome
  -  pool_matrix
  -  ransom_note
  -  rotate_matrix
  -  sentence_reordering
  -  spell_backward
  -  spiral_matrix
  -  string_insertion
  -  string_manipulation
  -  string_synthesis
  -  word_ladder
  -  word_sequence_reversal
  -  word_sorting
eval_dir: results/claude-3.5-sonnet
dataset_size: 50
dataset_seed: 45
developer_role: system

Here you specify individual model and provider

Running Evaluations

To run evaluations

python eval.py --yaml <path-to yaml file>

e.g

python eval.py --yaml yaml/r1/algorithmic.yaml

To run r1 evaluations on algorithmic.yaml

The results of individual model on a dataset will be stored in a new folder in the directory E.g r1/algorithmic/proposition_logic.json. Please upload records of your results to reasoning-gym-eval.