updated read me

This commit is contained in:
joesharratt1229 2025-02-25 15:46:43 +00:00
parent 7bb791b338
commit 93b95d748b
14 changed files with 193 additions and 438 deletions

View file

@ -33,68 +33,66 @@ pip install -r requirements-eval.txt
export OPENROUTER_API_KEY=your-api-key
```
4. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`):
```json
[
{
"name": "dataset_name",
"parameter1": "value1",
"parameter2": "value2"
}
]
```
## Usage
4. Prepare your dataset configuration in YAML format (see examples in `yaml/algorithmic.yaml` or `yaml/logic.yaml`):
```yaml
model: model-name
category: category-name
datasets:
- dataset1
- dataset2
eval_dir: eval/r1
dataset_size: 50
dataset_seed: 42
developer_role: system
```
For example the following file will run an evaluation for deepseek r1 for algorithmic datasets
```yaml
model: deepseek/deepseek-r1
category: algorithmic
datasets:
- ab
- base_conversion
- binary_matrix
- caesar_cipher
- count_primes
- game_of_life
- graph_color
- group_anagrams
- isomorphic_strings
- letter_counting
- letter_jumble
- manipulate_matrix
- number_filtering
- number_sorting
- palindrome
- pool_matrix
- ransom_note
- rotate_matrix
- sentence_reordering
- spell_backward
- spiral_matrix
- string_insertion
- string_manipulation
- string_synthesis
- word_ladder
- word_sequence_reversal
- word_sorting
eval_dir: eval/r1
dataset_size: 50
dataset_seed: 45
developer_role: system
```
### Running Evaluations
You can run evaluations in two ways:
1. Using the provided bash script:
```bash
./eval.sh
To run evaluations
```
Before running, you may want to edit the `eval.sh` script to configure which models to evaluate by modifying the `MODELS` array.
2. Running the Python script directly:
```bash
python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
python eval.py --yaml <path-to yaml file>
```
### Command Line Arguments
- `--model`: Model identifier (required)
- `--config`: Path to JSON configuration file (required)
- `--output-dir`: Directory for saving results (default: "results")
- `--max-concurrent`: Maximum number of concurrent API calls (default: 10)
## Output
The framework generates two types of output files:
1. Detailed results: `evaluation_{model}_{timestamp}.json`
- Contains full response data and scoring for each question
2. Summary: `summary_{model}_{timestamp}.json`
- Contains aggregated metrics for each dataset
## Structure
e.g
```
.
├── eval.py # Main evaluation script
├── run_eval.sh # Bash script for running evaluations
├── eval_basic.json # Dataset configuration file
└── results/ # Output directory (for temporary results)
python eval.py --yaml yaml/algorithmic.yaml
```
## Contributing Evaluation Results
After running evaluations:
1. Fork the [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval) repository
2. Add your evaluation results to the appropriate directory
3. Create a pull request with your results
This helps us maintain a clean separation between code and evaluation data while collecting comprehensive benchmarks across different models.