mirror of https://github.com/open-thought/reasoning-gym.git synced 2026-04-19 12:58:07 +00:00

History

Zafir Stojanovski d557b1b4f9 contribution updates		2025-02-20 09:54:26 +01:00
..
r1	contribution updates	2025-02-20 09:54:26 +01:00
results	lint	2025-02-12 11:21:46 +01:00
.gitignore	[eval-basic] remove large results files, add gitignore, only leave summary	2025-02-09 22:52:10 -08:00
eval.py	system prompt for structured output, and parse such outputs	2025-02-12 10:44:42 +01:00
eval.sh	reset `eval.sh`	2025-02-12 10:58:49 +01:00
eval_basic.json	[eval-v1] benchmark with 50 samples	2025-02-10 22:05:09 -08:00
README.md	[eval-v1] add a simple readme with some details	2025-02-10 21:57:00 -08:00

README.md

Model Evaluation Framework

A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.

Overview

This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:

Concurrent evaluation of multiple questions and datasets
Customizable dataset configurations
Automatic result aggregation and summary generation
Rate limiting for API calls

Setup

Set your OpenRouter API key as an environment variable:

export OPENROUTER_API_KEY=your-api-key

Prepare your dataset configuration in JSON format (e.g., eval_basic.json):

[
  {
    "name": "dataset_name",
    "parameter1": "value1",
    "parameter2": "value2"
  }
]

Usage

Running Evaluations

You can run evaluations in two ways:

Using the provided bash script:

./run_eval.sh

Running the Python script directly:

python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"

Command Line Arguments

--model: Model identifier (required)
--config: Path to JSON configuration file (required)
--output-dir: Directory for saving results (default: "results")
--max-concurrent: Maximum number of concurrent API calls (default: 10)

Output

The framework generates two types of output files:

Detailed results: evaluation_{model}_{timestamp}.json
- Contains full response data and scoring for each question
Summary: summary_{model}_{timestamp}.json
- Contains aggregated metrics for each dataset

Structure

.
├── eval.py              # Main evaluation script
├── run_eval.sh          # Bash script for running evaluations
├── eval_basic.json      # Dataset configuration file
└── results/             # Output directory