mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-24 17:05:03 +00:00
updated read me
This commit is contained in:
parent
7bb791b338
commit
93b95d748b
14 changed files with 193 additions and 438 deletions
112
eval/README.md
112
eval/README.md
|
|
@ -33,68 +33,66 @@ pip install -r requirements-eval.txt
|
|||
export OPENROUTER_API_KEY=your-api-key
|
||||
```
|
||||
|
||||
4. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`):
|
||||
```json
|
||||
[
|
||||
{
|
||||
"name": "dataset_name",
|
||||
"parameter1": "value1",
|
||||
"parameter2": "value2"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## Usage
|
||||
4. Prepare your dataset configuration in YAML format (see examples in `yaml/algorithmic.yaml` or `yaml/logic.yaml`):
|
||||
```yaml
|
||||
model: model-name
|
||||
category: category-name
|
||||
datasets:
|
||||
- dataset1
|
||||
- dataset2
|
||||
eval_dir: eval/r1
|
||||
dataset_size: 50
|
||||
dataset_seed: 42
|
||||
developer_role: system
|
||||
|
||||
```
|
||||
For example the following file will run an evaluation for deepseek r1 for algorithmic datasets
|
||||
|
||||
```yaml
|
||||
model: deepseek/deepseek-r1
|
||||
category: algorithmic
|
||||
datasets:
|
||||
- ab
|
||||
- base_conversion
|
||||
- binary_matrix
|
||||
- caesar_cipher
|
||||
- count_primes
|
||||
- game_of_life
|
||||
- graph_color
|
||||
- group_anagrams
|
||||
- isomorphic_strings
|
||||
- letter_counting
|
||||
- letter_jumble
|
||||
- manipulate_matrix
|
||||
- number_filtering
|
||||
- number_sorting
|
||||
- palindrome
|
||||
- pool_matrix
|
||||
- ransom_note
|
||||
- rotate_matrix
|
||||
- sentence_reordering
|
||||
- spell_backward
|
||||
- spiral_matrix
|
||||
- string_insertion
|
||||
- string_manipulation
|
||||
- string_synthesis
|
||||
- word_ladder
|
||||
- word_sequence_reversal
|
||||
- word_sorting
|
||||
eval_dir: eval/r1
|
||||
dataset_size: 50
|
||||
dataset_seed: 45
|
||||
developer_role: system
|
||||
```
|
||||
|
||||
### Running Evaluations
|
||||
|
||||
You can run evaluations in two ways:
|
||||
|
||||
1. Using the provided bash script:
|
||||
```bash
|
||||
./eval.sh
|
||||
To run evaluations
|
||||
```
|
||||
|
||||
Before running, you may want to edit the `eval.sh` script to configure which models to evaluate by modifying the `MODELS` array.
|
||||
|
||||
2. Running the Python script directly:
|
||||
```bash
|
||||
python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
|
||||
python eval.py --yaml <path-to yaml file>
|
||||
```
|
||||
|
||||
### Command Line Arguments
|
||||
|
||||
- `--model`: Model identifier (required)
|
||||
- `--config`: Path to JSON configuration file (required)
|
||||
- `--output-dir`: Directory for saving results (default: "results")
|
||||
- `--max-concurrent`: Maximum number of concurrent API calls (default: 10)
|
||||
|
||||
## Output
|
||||
|
||||
The framework generates two types of output files:
|
||||
|
||||
1. Detailed results: `evaluation_{model}_{timestamp}.json`
|
||||
- Contains full response data and scoring for each question
|
||||
|
||||
2. Summary: `summary_{model}_{timestamp}.json`
|
||||
- Contains aggregated metrics for each dataset
|
||||
|
||||
## Structure
|
||||
|
||||
e.g
|
||||
```
|
||||
.
|
||||
├── eval.py # Main evaluation script
|
||||
├── run_eval.sh # Bash script for running evaluations
|
||||
├── eval_basic.json # Dataset configuration file
|
||||
└── results/ # Output directory (for temporary results)
|
||||
python eval.py --yaml yaml/algorithmic.yaml
|
||||
```
|
||||
|
||||
## Contributing Evaluation Results
|
||||
|
||||
After running evaluations:
|
||||
|
||||
1. Fork the [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval) repository
|
||||
2. Add your evaluation results to the appropriate directory
|
||||
3. Create a pull request with your results
|
||||
|
||||
This helps us maintain a clean separation between code and evaluation data while collecting comprehensive benchmarks across different models.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue