reasoning-gym/eval/README.md

# Model Evaluation Framework

A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.

## Overview

This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:
- Concurrent evaluation of multiple questions and datasets
- Customizable dataset configurations
- Automatic result aggregation and summary generation
- Rate limiting for API calls

## Setup

1. Set your OpenRouter API key as an environment variable:
```bash
export OPENROUTER_API_KEY=your-api-key
```

2. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`):
```json
[
  {
    "name": "dataset_name",
    "parameter1": "value1",
    "parameter2": "value2"
  }
]
```

## Usage

### Running Evaluations

You can run evaluations in two ways:

1. Using the provided bash script:
```bash
./run_eval.sh
```

2. Running the Python script directly:
```bash
python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
```

### Command Line Arguments

- `--model`: Model identifier (required)
- `--config`: Path to JSON configuration file (required)
- `--output-dir`: Directory for saving results (default: "results")
- `--max-concurrent`: Maximum number of concurrent API calls (default: 10)

## Output

The framework generates two types of output files:

1. Detailed results: `evaluation_{model}_{timestamp}.json`
   - Contains full response data and scoring for each question

2. Summary: `summary_{model}_{timestamp}.json`
   - Contains aggregated metrics for each dataset

## Structure

```
.
├── eval.py              # Main evaluation script
├── run_eval.sh          # Bash script for running evaluations
├── eval_basic.json      # Dataset configuration file
└── results/             # Output directory
```