mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-19 12:58:07 +00:00
72 lines
1.8 KiB
Markdown
72 lines
1.8 KiB
Markdown
# Model Evaluation Framework
|
|
|
|
A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.
|
|
|
|
## Overview
|
|
|
|
This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:
|
|
- Concurrent evaluation of multiple questions and datasets
|
|
- Customizable dataset configurations
|
|
- Automatic result aggregation and summary generation
|
|
- Rate limiting for API calls
|
|
|
|
## Setup
|
|
|
|
1. Set your OpenRouter API key as an environment variable:
|
|
```bash
|
|
export OPENROUTER_API_KEY=your-api-key
|
|
```
|
|
|
|
2. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`):
|
|
```json
|
|
[
|
|
{
|
|
"name": "dataset_name",
|
|
"parameter1": "value1",
|
|
"parameter2": "value2"
|
|
}
|
|
]
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Running Evaluations
|
|
|
|
You can run evaluations in two ways:
|
|
|
|
1. Using the provided bash script:
|
|
```bash
|
|
./run_eval.sh
|
|
```
|
|
|
|
2. Running the Python script directly:
|
|
```bash
|
|
python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
|
|
```
|
|
|
|
### Command Line Arguments
|
|
|
|
- `--model`: Model identifier (required)
|
|
- `--config`: Path to JSON configuration file (required)
|
|
- `--output-dir`: Directory for saving results (default: "results")
|
|
- `--max-concurrent`: Maximum number of concurrent API calls (default: 10)
|
|
|
|
## Output
|
|
|
|
The framework generates two types of output files:
|
|
|
|
1. Detailed results: `evaluation_{model}_{timestamp}.json`
|
|
- Contains full response data and scoring for each question
|
|
|
|
2. Summary: `summary_{model}_{timestamp}.json`
|
|
- Contains aggregated metrics for each dataset
|
|
|
|
## Structure
|
|
|
|
```
|
|
.
|
|
├── eval.py # Main evaluation script
|
|
├── run_eval.sh # Bash script for running evaluations
|
|
├── eval_basic.json # Dataset configuration file
|
|
└── results/ # Output directory
|
|
```
|