mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-22 16:49:06 +00:00
Eval script consolidation (#238)
The script now supports: - YAML and JSON configurations - Dataset-specific parameters - Overriding configuration via command line - Detailed logging and error handling
This commit is contained in:
parent
8a66d2a216
commit
850c1cf6f4
40 changed files with 1111 additions and 670 deletions
174
eval/README.md
174
eval/README.md
|
|
@ -34,108 +34,100 @@ export OPENROUTER_API_KEY=your-api-key
|
|||
```
|
||||
|
||||
|
||||
4. Prepare your dataset configuration in YAML format (see examples in `yaml/<model_name>/algorithmic.yaml` e.g `yaml/r1/algorithmic.yaml`):
|
||||
4. Prepare your evaluation configuration in YAML or JSON format (see example in `example_config.yaml`):
|
||||
|
||||
```yaml
|
||||
model: model-name
|
||||
provider: provider-name
|
||||
category: category-name
|
||||
datasets:
|
||||
- dataset1
|
||||
- dataset2
|
||||
eval_dir: results/model-name
|
||||
dataset_size: 50
|
||||
dataset_seed: 42
|
||||
developer_role: system
|
||||
# Example configuration
|
||||
model: "meta-llama/llama-3.3-70b-instruct"
|
||||
provider: "Hyperbolic" # Optional, can be omitted
|
||||
output_dir: "results"
|
||||
max_concurrent: 10
|
||||
default_size: 20 # Default size for all datasets
|
||||
default_seed: 42 # Default seed for all datasets
|
||||
|
||||
```
|
||||
For example the following file will run an evaluation for deepseek r1 for algorithmic datasets.
|
||||
``` yaml
|
||||
model: deepseek/deepseek-r1
|
||||
provider: Nebius
|
||||
category: algorithmic
|
||||
datasets:
|
||||
- ab
|
||||
- base_conversion
|
||||
- binary_matrix
|
||||
- caesar_cipher
|
||||
- count_primes
|
||||
- game_of_life
|
||||
- graph_color
|
||||
- group_anagrams
|
||||
- isomorphic_strings
|
||||
- letter_counting
|
||||
- letter_jumble
|
||||
- manipulate_matrix
|
||||
- number_filtering
|
||||
- number_sorting
|
||||
- palindrome
|
||||
- pool_matrix
|
||||
- ransom_note
|
||||
- rotate_matrix
|
||||
- sentence_reordering
|
||||
- spell_backward
|
||||
- spiral_matrix
|
||||
- string_insertion
|
||||
- string_manipulation
|
||||
- string_synthesis
|
||||
- word_ladder
|
||||
- word_sequence_reversal
|
||||
- word_sorting
|
||||
eval_dir: results/deepseek-r1
|
||||
dataset_size: 50
|
||||
dataset_seed: 45
|
||||
developer_role: system
|
||||
categories:
|
||||
- category: "algebra"
|
||||
datasets:
|
||||
- dataset: "complex_arithmetic"
|
||||
params:
|
||||
min_real: -10
|
||||
max_real: 10
|
||||
min_imag: -10
|
||||
max_imag: 10
|
||||
|
||||
- category: "arithmetic"
|
||||
datasets:
|
||||
- dataset: "chain_sum"
|
||||
size: 12
|
||||
seed: 43
|
||||
params:
|
||||
min_digits: 2
|
||||
allow_negation: true
|
||||
|
||||
- dataset: "products"
|
||||
size: 10
|
||||
seed: 43
|
||||
params:
|
||||
min_digits: 2
|
||||
allow_negation: true
|
||||
```
|
||||
|
||||
The following would run Claude 3.5 on the algorithmic dataset.
|
||||
For example, to evaluate Claude 3.5 Sonnet on algorithmic datasets:
|
||||
|
||||
```yaml
|
||||
model: anthropic/claude-3.5-sonnet
|
||||
category: algorithmic
|
||||
provider: Anthropic
|
||||
datasets:
|
||||
- count_primes
|
||||
- game_of_life
|
||||
- graph_color
|
||||
- group_anagrams
|
||||
- isomorphic_strings
|
||||
- letter_counting
|
||||
- letter_jumble
|
||||
- manipulate_matrix
|
||||
- number_filtering
|
||||
- number_sorting
|
||||
- palindrome
|
||||
- pool_matrix
|
||||
- ransom_note
|
||||
- rotate_matrix
|
||||
- sentence_reordering
|
||||
- spell_backward
|
||||
- spiral_matrix
|
||||
- string_insertion
|
||||
- string_manipulation
|
||||
- string_synthesis
|
||||
- word_ladder
|
||||
- word_sequence_reversal
|
||||
- word_sorting
|
||||
eval_dir: results/claude-3.5-sonnet
|
||||
dataset_size: 50
|
||||
dataset_seed: 45
|
||||
developer_role: system
|
||||
model: "anthropic/claude-3.5-sonnet"
|
||||
provider: "Anthropic"
|
||||
output_dir: "results"
|
||||
max_concurrent: 5
|
||||
default_size: 50
|
||||
default_seed: 45
|
||||
|
||||
categories:
|
||||
- category: "algorithmic"
|
||||
datasets:
|
||||
- dataset: "count_primes"
|
||||
- dataset: "game_of_life"
|
||||
- dataset: "graph_color"
|
||||
- dataset: "isomorphic_strings"
|
||||
- dataset: "letter_jumble"
|
||||
- dataset: "rotate_matrix"
|
||||
- dataset: "sentence_reordering"
|
||||
- dataset: "string_manipulation"
|
||||
- dataset: "word_ladder"
|
||||
- dataset: "word_sorting"
|
||||
```
|
||||
Here you specify individual model and provider
|
||||
|
||||
### Running Evaluations
|
||||
|
||||
To run evaluations
|
||||
To run evaluations:
|
||||
|
||||
```bash
|
||||
python eval.py --config configs/your_config.yaml
|
||||
```
|
||||
python eval.py --yaml <path-to yaml file>
|
||||
|
||||
For example:
|
||||
|
||||
```bash
|
||||
python eval.py --config example_config.yaml --full-results
|
||||
```
|
||||
e.g
|
||||
```
|
||||
python eval.py --yaml yaml/r1/algorithmic.yaml
|
||||
```
|
||||
To run r1 evaluations on algorithmic.yaml
|
||||
|
||||
|
||||
The results of individual model on a dataset will be stored in a new folder in the directory E.g `r1/algorithmic/proposition_logic.json`.
|
||||
Please upload records of your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).
|
||||
The results will be stored in a directory named after the model and timestamp, containing:
|
||||
- `summary.json` - Summary of all results
|
||||
- `results.json` - Full results (if `--full-results` is specified)
|
||||
- Individual dataset results in category subdirectories
|
||||
|
||||
For example:
|
||||
```
|
||||
results/
|
||||
└── meta-llama_llama-3.3-70b-instruct_20250227_162030/
|
||||
├── summary.json
|
||||
├── results.json
|
||||
├── algebra/
|
||||
│ └── complex_arithmetic.json
|
||||
└── arithmetic/
|
||||
├── chain_sum.json
|
||||
└── products.json
|
||||
```
|
||||
|
||||
Please upload your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue