reasoning-gym/eval
2025-02-12 10:44:42 +01:00
..
r1 corrected small linting err in cognition.yaml 2025-02-11 06:56:04 +00:00
results system prompt for structured output, and parse such outputs 2025-02-12 10:44:42 +01:00
.gitignore [eval-basic] remove large results files, add gitignore, only leave summary 2025-02-09 22:52:10 -08:00
eval.py system prompt for structured output, and parse such outputs 2025-02-12 10:44:42 +01:00
eval.sh system prompt for structured output, and parse such outputs 2025-02-12 10:44:42 +01:00
eval_basic.json [eval-v1] benchmark with 50 samples 2025-02-10 22:05:09 -08:00
README.md [eval-v1] add a simple readme with some details 2025-02-10 21:57:00 -08:00

Model Evaluation Framework

A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.

Overview

This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:

  • Concurrent evaluation of multiple questions and datasets
  • Customizable dataset configurations
  • Automatic result aggregation and summary generation
  • Rate limiting for API calls

Setup

  1. Set your OpenRouter API key as an environment variable:
export OPENROUTER_API_KEY=your-api-key
  1. Prepare your dataset configuration in JSON format (e.g., eval_basic.json):
[
  {
    "name": "dataset_name",
    "parameter1": "value1",
    "parameter2": "value2"
  }
]

Usage

Running Evaluations

You can run evaluations in two ways:

  1. Using the provided bash script:
./run_eval.sh
  1. Running the Python script directly:
python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"

Command Line Arguments

  • --model: Model identifier (required)
  • --config: Path to JSON configuration file (required)
  • --output-dir: Directory for saving results (default: "results")
  • --max-concurrent: Maximum number of concurrent API calls (default: 10)

Output

The framework generates two types of output files:

  1. Detailed results: evaluation_{model}_{timestamp}.json

    • Contains full response data and scoring for each question
  2. Summary: summary_{model}_{timestamp}.json

    • Contains aggregated metrics for each dataset

Structure

.
├── eval.py              # Main evaluation script
├── run_eval.sh          # Bash script for running evaluations
├── eval_basic.json      # Dataset configuration file
└── results/             # Output directory