mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-19 12:58:07 +00:00
[eval-v1] add a simple readme with some details
This commit is contained in:
parent
9e4870125d
commit
fb40c8ca55
1 changed files with 72 additions and 0 deletions
72
eval/README.md
Normal file
72
eval/README.md
Normal file
|
|
@ -0,0 +1,72 @@
|
|||
# Model Evaluation Framework
|
||||
|
||||
A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.
|
||||
|
||||
## Overview
|
||||
|
||||
This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:
|
||||
- Concurrent evaluation of multiple questions and datasets
|
||||
- Customizable dataset configurations
|
||||
- Automatic result aggregation and summary generation
|
||||
- Rate limiting for API calls
|
||||
|
||||
## Setup
|
||||
|
||||
1. Set your OpenRouter API key as an environment variable:
|
||||
```bash
|
||||
export OPENROUTER_API_KEY=your-api-key
|
||||
```
|
||||
|
||||
2. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`):
|
||||
```json
|
||||
[
|
||||
{
|
||||
"name": "dataset_name",
|
||||
"parameter1": "value1",
|
||||
"parameter2": "value2"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Running Evaluations
|
||||
|
||||
You can run evaluations in two ways:
|
||||
|
||||
1. Using the provided bash script:
|
||||
```bash
|
||||
./run_eval.sh
|
||||
```
|
||||
|
||||
2. Running the Python script directly:
|
||||
```bash
|
||||
python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
|
||||
```
|
||||
|
||||
### Command Line Arguments
|
||||
|
||||
- `--model`: Model identifier (required)
|
||||
- `--config`: Path to JSON configuration file (required)
|
||||
- `--output-dir`: Directory for saving results (default: "results")
|
||||
- `--max-concurrent`: Maximum number of concurrent API calls (default: 10)
|
||||
|
||||
## Output
|
||||
|
||||
The framework generates two types of output files:
|
||||
|
||||
1. Detailed results: `evaluation_{model}_{timestamp}.json`
|
||||
- Contains full response data and scoring for each question
|
||||
|
||||
2. Summary: `summary_{model}_{timestamp}.json`
|
||||
- Contains aggregated metrics for each dataset
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
.
|
||||
├── eval.py # Main evaluation script
|
||||
├── run_eval.sh # Bash script for running evaluations
|
||||
├── eval_basic.json # Dataset configuration file
|
||||
└── results/ # Output directory
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue