[eval-v1] add a simple readme with some details

2026-04-19 12:58:07 +00:00 · 2025-02-10 21:57:00 -08:00 · 2025-02-10 21:57:00 -08:00 · fb40c8ca55
commit fb40c8ca55
parent 9e4870125d
1 changed files with 72 additions and 0 deletions
--- a/eval/README.md
+++ b/eval/README.md
@ -0,0 +1,72 @@
+# Model Evaluation Framework
+
+A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.
+
+## Overview
+
+This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:
+- Concurrent evaluation of multiple questions and datasets
+- Customizable dataset configurations
+- Automatic result aggregation and summary generation
+- Rate limiting for API calls
+
+## Setup
+
+1. Set your OpenRouter API key as an environment variable:
+```bash
+export OPENROUTER_API_KEY=your-api-key
+```
+
+2. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`):
+```json
+[
+  {
+    "name": "dataset_name",
+    "parameter1": "value1",
+    "parameter2": "value2"
+  }
+]
+```
+
+## Usage
+
+### Running Evaluations
+
+You can run evaluations in two ways:
+
+1. Using the provided bash script:
+```bash
+./run_eval.sh
+```
+
+2. Running the Python script directly:
+```bash
+python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
+```
+
+### Command Line Arguments
+
+- `--model`: Model identifier (required)
+- `--config`: Path to JSON configuration file (required)
+- `--output-dir`: Directory for saving results (default: "results")
+- `--max-concurrent`: Maximum number of concurrent API calls (default: 10)
+
+## Output
+
+The framework generates two types of output files:
+
+1. Detailed results: `evaluation_{model}_{timestamp}.json`
+   - Contains full response data and scoring for each question
+
+2. Summary: `summary_{model}_{timestamp}.json`
+   - Contains aggregated metrics for each dataset
+
+## Structure
+
+```
+.
+├── eval.py              # Main evaluation script
+├── run_eval.sh          # Bash script for running evaluations
+├── eval_basic.json      # Dataset configuration file
+└── results/             # Output directory
+```