mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-19 12:58:07 +00:00
141 lines
3.5 KiB
Markdown
141 lines
3.5 KiB
Markdown
# Model Evaluation Framework
|
|
|
|
A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.
|
|
|
|
## Evaluation Results Repository
|
|
|
|
In order to keep the main repo clean and not clutter it with evaluation traces from different models, we store all evaluation results in a separate repository: [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).
|
|
|
|
If you run evaluations and want to contribute your results, please create a pull request in the [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval) repository, not in the main reasoning-gym repo.
|
|
|
|
## Overview
|
|
|
|
This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:
|
|
- Concurrent evaluation of multiple questions and datasets
|
|
- Customizable dataset configurations
|
|
- Automatic result aggregation and summary generation
|
|
- Rate limiting for API calls
|
|
|
|
## Setup
|
|
|
|
1. Install reasoning-gym in development mode:
|
|
```bash
|
|
pip install -e ..
|
|
```
|
|
|
|
2. Install the additional dependencies required for evaluation:
|
|
```bash
|
|
pip install -r requirements-eval.txt
|
|
```
|
|
|
|
3. Set your OpenRouter API key as an environment variable:
|
|
```bash
|
|
export OPENROUTER_API_KEY=your-api-key
|
|
```
|
|
|
|
|
|
4. Prepare your dataset configuration in YAML format (see examples in `yaml/<model_name>/algorithmic.yaml` e.g `yaml/r1/algorithmic.yaml`):
|
|
```yaml
|
|
model: model-name
|
|
provider: provider-name
|
|
category: category-name
|
|
datasets:
|
|
- dataset1
|
|
- dataset2
|
|
eval_dir: results/model-name
|
|
dataset_size: 50
|
|
dataset_seed: 42
|
|
developer_role: system
|
|
|
|
```
|
|
For example the following file will run an evaluation for deepseek r1 for algorithmic datasets.
|
|
``` yaml
|
|
model: deepseek/deepseek-r1
|
|
provider: Nebius
|
|
category: algorithmic
|
|
datasets:
|
|
- ab
|
|
- base_conversion
|
|
- binary_matrix
|
|
- caesar_cipher
|
|
- count_primes
|
|
- game_of_life
|
|
- graph_color
|
|
- group_anagrams
|
|
- isomorphic_strings
|
|
- letter_counting
|
|
- letter_jumble
|
|
- manipulate_matrix
|
|
- number_filtering
|
|
- number_sorting
|
|
- palindrome
|
|
- pool_matrix
|
|
- ransom_note
|
|
- rotate_matrix
|
|
- sentence_reordering
|
|
- spell_backward
|
|
- spiral_matrix
|
|
- string_insertion
|
|
- string_manipulation
|
|
- string_synthesis
|
|
- word_ladder
|
|
- word_sequence_reversal
|
|
- word_sorting
|
|
eval_dir: results/deepseek-r1
|
|
dataset_size: 50
|
|
dataset_seed: 45
|
|
developer_role: system
|
|
|
|
```
|
|
|
|
The following would run Claude 3.5 on the algorithmic dataset.
|
|
```yaml
|
|
model: anthropic/claude-3.5-sonnet
|
|
category: algorithmic
|
|
provider: Anthropic
|
|
datasets:
|
|
- count_primes
|
|
- game_of_life
|
|
- graph_color
|
|
- group_anagrams
|
|
- isomorphic_strings
|
|
- letter_counting
|
|
- letter_jumble
|
|
- manipulate_matrix
|
|
- number_filtering
|
|
- number_sorting
|
|
- palindrome
|
|
- pool_matrix
|
|
- ransom_note
|
|
- rotate_matrix
|
|
- sentence_reordering
|
|
- spell_backward
|
|
- spiral_matrix
|
|
- string_insertion
|
|
- string_manipulation
|
|
- string_synthesis
|
|
- word_ladder
|
|
- word_sequence_reversal
|
|
- word_sorting
|
|
eval_dir: results/claude-3.5-sonnet
|
|
dataset_size: 50
|
|
dataset_seed: 45
|
|
developer_role: system
|
|
```
|
|
Here you specify individual model and provider
|
|
|
|
### Running Evaluations
|
|
|
|
To run evaluations
|
|
```
|
|
python eval.py --yaml <path-to yaml file>
|
|
```
|
|
e.g
|
|
```
|
|
python eval.py --yaml yaml/r1/algorithmic.yaml
|
|
```
|
|
To run r1 evaluations on algorithmic.yaml
|
|
|
|
|
|
The results of individual model on a dataset will be stored in a new folder in the directory E.g `r1/algorithmic/proposition_logic.json`.
|
|
Please upload records of your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).
|