updated read me

2026-04-24 17:05:03 +00:00 · 2025-02-25 15:46:43 +00:00 · 2025-02-25 15:46:43 +00:00 · 93b95d748b
commit 93b95d748b
parent 7bb791b338
14 changed files with 193 additions and 438 deletions
--- a/eval/README.md
+++ b/eval/README.md
@ -33,68 +33,66 @@ pip install -r requirements-eval.txt
 export OPENROUTER_API_KEY=your-api-key
 ```

-4. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`):
-```json
-[
-  {
-    "name": "dataset_name",
-    "parameter1": "value1",
-    "parameter2": "value2"
-  }
-]
-```

-## Usage
+4. Prepare your dataset configuration in YAML format (see examples in `yaml/algorithmic.yaml` or `yaml/logic.yaml`):
+```yaml
+model: model-name
+category: category-name
+datasets:
+  - dataset1
+  - dataset2
+eval_dir: eval/r1
+dataset_size: 50
+dataset_seed: 42
+developer_role: system
+
+```
+For example the following file will run an evaluation for deepseek r1 for algorithmic datasets
+
+```yaml
+model: deepseek/deepseek-r1
+category: algorithmic
+datasets:
+  - ab
+  -  base_conversion
+  -  binary_matrix
+  -  caesar_cipher
+  -  count_primes
+  -  game_of_life
+  -  graph_color
+  -  group_anagrams
+  -  isomorphic_strings
+  -  letter_counting
+  -  letter_jumble
+  -  manipulate_matrix
+  -  number_filtering
+  -  number_sorting
+  -  palindrome
+  -  pool_matrix
+  -  ransom_note
+  -  rotate_matrix
+  -  sentence_reordering
+  -  spell_backward
+  -  spiral_matrix
+  -  string_insertion
+  -  string_manipulation
+  -  string_synthesis
+  -  word_ladder
+  -  word_sequence_reversal
+  -  word_sorting
+eval_dir: eval/r1
+dataset_size: 50
+dataset_seed: 45
+developer_role: system
+```

 ### Running Evaluations

-You can run evaluations in two ways:
-
-1. Using the provided bash script:
-```bash
-./eval.sh
+To run evaluations
 ```
-
-   Before running, you may want to edit the `eval.sh` script to configure which models to evaluate by modifying the `MODELS` array.
-
-2. Running the Python script directly:
-```bash
-python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
+python eval.py --yaml <path-to yaml file>
 ```
-
-### Command Line Arguments
-
- `--model`: Model identifier (required)
- `--config`: Path to JSON configuration file (required)
- `--output-dir`: Directory for saving results (default: "results")
- `--max-concurrent`: Maximum number of concurrent API calls (default: 10)
-
-## Output
-
-The framework generates two types of output files:
-
-1. Detailed results: `evaluation_{model}_{timestamp}.json`
-   - Contains full response data and scoring for each question
-
-2. Summary: `summary_{model}_{timestamp}.json`
-   - Contains aggregated metrics for each dataset
-
-## Structure
-
+e.g
 ```
-.
-├── eval.py              # Main evaluation script
-├── run_eval.sh          # Bash script for running evaluations
-├── eval_basic.json      # Dataset configuration file
-└── results/             # Output directory (for temporary results)
+python eval.py --yaml yaml/algorithmic.yaml
 ```
-
-## Contributing Evaluation Results
-
-After running evaluations:
-
-1. Fork the [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval) repository
-2. Add your evaluation results to the appropriate directory
-3. Create a pull request with your results
-
-This helps us maintain a clean separation between code and evaluation data while collecting comprehensive benchmarks across different models.