reasoning-gym/eval
2025-02-25 19:41:21 +01:00
..
yaml use results folder name for eval results 2025-02-25 19:41:21 +01:00
__init__.py updated read me 2025-02-25 15:46:43 +00:00
eval.py Merge remote-tracking branch 'origin/consolidate_eval_script' into fix/eval 2025-02-25 18:10:07 +00:00
eval_config.py updated config and read me 2025-02-25 16:25:16 +00:00
README.md use results folder name for eval results 2025-02-25 19:41:21 +01:00
requirements-eval.txt consolidate eval scripts to have single eval.py 2025-02-25 16:13:22 +01:00

Model Evaluation Framework

A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.

Evaluation Results Repository

In order to keep the main repo clean and not clutter it with evaluation traces from different models, we store all evaluation results in a separate repository: reasoning-gym-eval.

If you run evaluations and want to contribute your results, please create a pull request in the reasoning-gym-eval repository, not in the main reasoning-gym repo.

Overview

This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:

  • Concurrent evaluation of multiple questions and datasets
  • Customizable dataset configurations
  • Automatic result aggregation and summary generation
  • Rate limiting for API calls

Setup

  1. Install reasoning-gym in development mode:
pip install -e ..
  1. Install the additional dependencies required for evaluation:
pip install -r requirements-eval.txt
  1. Set your OpenRouter API key as an environment variable:
export OPENROUTER_API_KEY=your-api-key
  1. Prepare your dataset configuration in YAML format (see examples in yaml/<model_name>/algorithmic.yaml e.g yaml/r1/algorithmic.yaml):
model: model-name
provider: provider-name
category: category-name
datasets:
  - dataset1
  - dataset2
eval_dir: results/model-name
dataset_size: 50
dataset_seed: 42
developer_role: system

For example the following file will run an evaluation for deepseek r1 for algorithmic datasets.

model: deepseek/deepseek-r1
provider: Nebius
category: algorithmic
datasets:
  - ab
  -  base_conversion
  -  binary_matrix
  -  caesar_cipher
  -  count_primes
  -  game_of_life
  -  graph_color
  -  group_anagrams
  -  isomorphic_strings
  -  letter_counting
  -  letter_jumble
  -  manipulate_matrix
  -  number_filtering
  -  number_sorting
  -  palindrome
  -  pool_matrix
  -  ransom_note
  -  rotate_matrix
  -  sentence_reordering
  -  spell_backward
  -  spiral_matrix
  -  string_insertion
  -  string_manipulation
  -  string_synthesis
  -  word_ladder
  -  word_sequence_reversal
  -  word_sorting
eval_dir: results/deepseek-r1
dataset_size: 50
dataset_seed: 45
developer_role: system

The following would run Claude 3.5 on the algorithmic dataset.

model: anthropic/claude-3.5-sonnet
category: algorithmic
provider: Anthropic
datasets:
  -  count_primes
  -  game_of_life
  -  graph_color
  -  group_anagrams
  -  isomorphic_strings
  -  letter_counting
  -  letter_jumble
  -  manipulate_matrix
  -  number_filtering
  -  number_sorting
  -  palindrome
  -  pool_matrix
  -  ransom_note
  -  rotate_matrix
  -  sentence_reordering
  -  spell_backward
  -  spiral_matrix
  -  string_insertion
  -  string_manipulation
  -  string_synthesis
  -  word_ladder
  -  word_sequence_reversal
  -  word_sorting
eval_dir: results/claude-3.5-sonnet
dataset_size: 50
dataset_seed: 45
developer_role: system

Here you specify individual model and provider

Running Evaluations

To run evaluations

python eval.py --yaml <path-to yaml file>

e.g

python eval.py --yaml yaml/r1/algorithmic.yaml

To run r1 evaluations on algorithmic.yaml

The results of individual model on a dataset will be stored in a new folder in the directory E.g r1/algorithmic/proposition_logic.json. Please upload records of your results to reasoning-gym-eval.