Eval script consolidation (#238)

The script now supports: - YAML and JSON configurations - Dataset-specific parameters - Overriding configuration via command line - Detailed logging and error handling
2026-04-22 16:49:06 +00:00 · 2025-02-27 17:39:14 +01:00 · 2025-02-27 17:39:14 +01:00 · 850c1cf6f4
commit 850c1cf6f4
parent 8a66d2a216
40 changed files with 1111 additions and 670 deletions
--- a/eval/README.md
+++ b/eval/README.md
@ -34,108 +34,100 @@ export OPENROUTER_API_KEY=your-api-key
 ```


-4. Prepare your dataset configuration in YAML format (see examples in `yaml/<model_name>/algorithmic.yaml` e.g `yaml/r1/algorithmic.yaml`):
+4. Prepare your evaluation configuration in YAML or JSON format (see example in `example_config.yaml`):
+
 ```yaml
-model: model-name
-provider: provider-name
-category: category-name
-datasets:
-  - dataset1
-  - dataset2
-eval_dir: results/model-name
-dataset_size: 50
-dataset_seed: 42
-developer_role: system
+# Example configuration
+model: "meta-llama/llama-3.3-70b-instruct"
+provider: "Hyperbolic"  # Optional, can be omitted
+output_dir: "results"
+max_concurrent: 10
+default_size: 20  # Default size for all datasets
+default_seed: 42  # Default seed for all datasets

-```
-For example the following file will run an evaluation for deepseek r1 for algorithmic datasets.
-``` yaml
-model: deepseek/deepseek-r1
-provider: Nebius
-category: algorithmic
-datasets:
-  - ab
-  -  base_conversion
-  -  binary_matrix
-  -  caesar_cipher
-  -  count_primes
-  -  game_of_life
-  -  graph_color
-  -  group_anagrams
-  -  isomorphic_strings
-  -  letter_counting
-  -  letter_jumble
-  -  manipulate_matrix
-  -  number_filtering
-  -  number_sorting
-  -  palindrome
-  -  pool_matrix
-  -  ransom_note
-  -  rotate_matrix
-  -  sentence_reordering
-  -  spell_backward
-  -  spiral_matrix
-  -  string_insertion
-  -  string_manipulation
-  -  string_synthesis
-  -  word_ladder
-  -  word_sequence_reversal
-  -  word_sorting
-eval_dir: results/deepseek-r1
-dataset_size: 50
-dataset_seed: 45
-developer_role: system
+categories:
+  - category: "algebra"
+    datasets:
+      - dataset: "complex_arithmetic"
+        params:
+          min_real: -10
+          max_real: 10
+          min_imag: -10
+          max_imag: 10

+  - category: "arithmetic"
+    datasets:
+      - dataset: "chain_sum"
+        size: 12
+        seed: 43
+        params:
+          min_digits: 2
+          allow_negation: true
+
+      - dataset: "products"
+        size: 10
+        seed: 43
+        params:
+          min_digits: 2
+          allow_negation: true
 ```

- The following would run Claude 3.5 on the algorithmic dataset.
+For example, to evaluate Claude 3.5 Sonnet on algorithmic datasets:
+
 ```yaml
-model: anthropic/claude-3.5-sonnet
-category: algorithmic
-provider: Anthropic
-datasets:
-  -  count_primes
-  -  game_of_life
-  -  graph_color
-  -  group_anagrams
-  -  isomorphic_strings
-  -  letter_counting
-  -  letter_jumble
-  -  manipulate_matrix
-  -  number_filtering
-  -  number_sorting
-  -  palindrome
-  -  pool_matrix
-  -  ransom_note
-  -  rotate_matrix
-  -  sentence_reordering
-  -  spell_backward
-  -  spiral_matrix
-  -  string_insertion
-  -  string_manipulation
-  -  string_synthesis
-  -  word_ladder
-  -  word_sequence_reversal
-  -  word_sorting
-eval_dir: results/claude-3.5-sonnet
-dataset_size: 50
-dataset_seed: 45
-developer_role: system
+model: "anthropic/claude-3.5-sonnet"
+provider: "Anthropic"
+output_dir: "results"
+max_concurrent: 5
+default_size: 50
+default_seed: 45
+
+categories:
+  - category: "algorithmic"
+    datasets:
+      - dataset: "count_primes"
+      - dataset: "game_of_life"
+      - dataset: "graph_color"
+      - dataset: "isomorphic_strings"
+      - dataset: "letter_jumble"
+      - dataset: "rotate_matrix"
+      - dataset: "sentence_reordering"
+      - dataset: "string_manipulation"
+      - dataset: "word_ladder"
+      - dataset: "word_sorting"
 ```
-Here you specify individual model and provider

 ### Running Evaluations

-To run evaluations
+To run evaluations:
+
+```bash
+python eval.py --config configs/your_config.yaml
 ```
-python eval.py --yaml <path-to yaml file>
+
+For example:
+
+```bash
+python eval.py --config example_config.yaml --full-results
 ```
-e.g
-```
-python eval.py --yaml yaml/r1/algorithmic.yaml
-```
-To run r1 evaluations on algorithmic.yaml


-The results of individual model on a dataset will be stored in a new folder in the directory E.g `r1/algorithmic/proposition_logic.json`.
-Please upload records of your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).
+The results will be stored in a directory named after the model and timestamp, containing:
+- `summary.json` - Summary of all results
+- `results.json` - Full results (if `--full-results` is specified)
+- Individual dataset results in category subdirectories
+
+For example:
+```
+results/
+└── meta-llama_llama-3.3-70b-instruct_20250227_162030/
+    ├── summary.json
+    ├── results.json
+    ├── algebra/
+    │   └── complex_arithmetic.json
+    └── arithmetic/
+        ├── chain_sum.json
+        └── products.json
+```
+
+Please upload your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).