Eval script consolidation (#238)

The script now supports:
   - YAML and JSON configurations
   - Dataset-specific parameters
   - Overriding configuration via command line
   - Detailed logging and error handling
This commit is contained in:
Andreas Köpf 2025-02-27 17:39:14 +01:00 committed by GitHub
parent 8a66d2a216
commit 850c1cf6f4
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
40 changed files with 1111 additions and 670 deletions

View file

@ -34,108 +34,100 @@ export OPENROUTER_API_KEY=your-api-key
```
4. Prepare your dataset configuration in YAML format (see examples in `yaml/<model_name>/algorithmic.yaml` e.g `yaml/r1/algorithmic.yaml`):
4. Prepare your evaluation configuration in YAML or JSON format (see example in `example_config.yaml`):
```yaml
model: model-name
provider: provider-name
category: category-name
datasets:
- dataset1
- dataset2
eval_dir: results/model-name
dataset_size: 50
dataset_seed: 42
developer_role: system
# Example configuration
model: "meta-llama/llama-3.3-70b-instruct"
provider: "Hyperbolic" # Optional, can be omitted
output_dir: "results"
max_concurrent: 10
default_size: 20 # Default size for all datasets
default_seed: 42 # Default seed for all datasets
```
For example the following file will run an evaluation for deepseek r1 for algorithmic datasets.
``` yaml
model: deepseek/deepseek-r1
provider: Nebius
category: algorithmic
datasets:
- ab
- base_conversion
- binary_matrix
- caesar_cipher
- count_primes
- game_of_life
- graph_color
- group_anagrams
- isomorphic_strings
- letter_counting
- letter_jumble
- manipulate_matrix
- number_filtering
- number_sorting
- palindrome
- pool_matrix
- ransom_note
- rotate_matrix
- sentence_reordering
- spell_backward
- spiral_matrix
- string_insertion
- string_manipulation
- string_synthesis
- word_ladder
- word_sequence_reversal
- word_sorting
eval_dir: results/deepseek-r1
dataset_size: 50
dataset_seed: 45
developer_role: system
categories:
- category: "algebra"
datasets:
- dataset: "complex_arithmetic"
params:
min_real: -10
max_real: 10
min_imag: -10
max_imag: 10
- category: "arithmetic"
datasets:
- dataset: "chain_sum"
size: 12
seed: 43
params:
min_digits: 2
allow_negation: true
- dataset: "products"
size: 10
seed: 43
params:
min_digits: 2
allow_negation: true
```
The following would run Claude 3.5 on the algorithmic dataset.
For example, to evaluate Claude 3.5 Sonnet on algorithmic datasets:
```yaml
model: anthropic/claude-3.5-sonnet
category: algorithmic
provider: Anthropic
datasets:
- count_primes
- game_of_life
- graph_color
- group_anagrams
- isomorphic_strings
- letter_counting
- letter_jumble
- manipulate_matrix
- number_filtering
- number_sorting
- palindrome
- pool_matrix
- ransom_note
- rotate_matrix
- sentence_reordering
- spell_backward
- spiral_matrix
- string_insertion
- string_manipulation
- string_synthesis
- word_ladder
- word_sequence_reversal
- word_sorting
eval_dir: results/claude-3.5-sonnet
dataset_size: 50
dataset_seed: 45
developer_role: system
model: "anthropic/claude-3.5-sonnet"
provider: "Anthropic"
output_dir: "results"
max_concurrent: 5
default_size: 50
default_seed: 45
categories:
- category: "algorithmic"
datasets:
- dataset: "count_primes"
- dataset: "game_of_life"
- dataset: "graph_color"
- dataset: "isomorphic_strings"
- dataset: "letter_jumble"
- dataset: "rotate_matrix"
- dataset: "sentence_reordering"
- dataset: "string_manipulation"
- dataset: "word_ladder"
- dataset: "word_sorting"
```
Here you specify individual model and provider
### Running Evaluations
To run evaluations
To run evaluations:
```bash
python eval.py --config configs/your_config.yaml
```
python eval.py --yaml <path-to yaml file>
For example:
```bash
python eval.py --config example_config.yaml --full-results
```
e.g
```
python eval.py --yaml yaml/r1/algorithmic.yaml
```
To run r1 evaluations on algorithmic.yaml
The results of individual model on a dataset will be stored in a new folder in the directory E.g `r1/algorithmic/proposition_logic.json`.
Please upload records of your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).
The results will be stored in a directory named after the model and timestamp, containing:
- `summary.json` - Summary of all results
- `results.json` - Full results (if `--full-results` is specified)
- Individual dataset results in category subdirectories
For example:
```
results/
└── meta-llama_llama-3.3-70b-instruct_20250227_162030/
├── summary.json
├── results.json
├── algebra/
│ └── complex_arithmetic.json
└── arithmetic/
├── chain_sum.json
└── products.json
```
Please upload your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).