mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-19 12:58:07 +00:00
| .. | ||
| r1 | ||
| results | ||
| .gitignore | ||
| eval.py | ||
| eval.sh | ||
| eval_basic.json | ||
| README.md | ||
Model Evaluation Framework
A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.
Overview
This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:
- Concurrent evaluation of multiple questions and datasets
- Customizable dataset configurations
- Automatic result aggregation and summary generation
- Rate limiting for API calls
Setup
- Set your OpenRouter API key as an environment variable:
export OPENROUTER_API_KEY=your-api-key
- Prepare your dataset configuration in JSON format (e.g.,
eval_basic.json):
[
{
"name": "dataset_name",
"parameter1": "value1",
"parameter2": "value2"
}
]
Usage
Running Evaluations
You can run evaluations in two ways:
- Using the provided bash script:
./run_eval.sh
- Running the Python script directly:
python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
Command Line Arguments
--model: Model identifier (required)--config: Path to JSON configuration file (required)--output-dir: Directory for saving results (default: "results")--max-concurrent: Maximum number of concurrent API calls (default: 10)
Output
The framework generates two types of output files:
-
Detailed results:
evaluation_{model}_{timestamp}.json- Contains full response data and scoring for each question
-
Summary:
summary_{model}_{timestamp}.json- Contains aggregated metrics for each dataset
Structure
.
├── eval.py # Main evaluation script
├── run_eval.sh # Bash script for running evaluations
├── eval_basic.json # Dataset configuration file
└── results/ # Output directory