[eval-v1] add a simple readme with some details

2026-04-19 12:58:07 +00:00 · 2025-02-10 21:57:00 -08:00 · 2025-02-10 21:57:00 -08:00 · fb40c8ca55
commit fb40c8ca55
parent 9e4870125d
1 changed files with 72 additions and 0 deletions
--- a/eval/README.md
+++ b/eval/README.md
@ -0,0 +1,72 @@
 # Model Evaluation Framework
 A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.
 ## Overview
 This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:
 - Concurrent evaluation of multiple questions and datasets
 - Customizable dataset configurations
 - Automatic result aggregation and summary generation
 - Rate limiting for API calls
 ## Setup
 1. Set your OpenRouter API key as an environment variable:
 ```bash
 export OPENROUTER_API_KEY=your-api-key
 ```
 2. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`):
 ```json
 [
  {
    "name": "dataset_name",
    "parameter1": "value1",
    "parameter2": "value2"
  }
 ]
 ```
 ## Usage
 ### Running Evaluations
 You can run evaluations in two ways:
 1. Using the provided bash script:
 ```bash
 ./run_eval.sh
 ```
 2. Running the Python script directly:
 ```bash
 python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
 ```
 ### Command Line Arguments
 - `--model`: Model identifier (required)
 - `--config`: Path to JSON configuration file (required)
 - `--output-dir`: Directory for saving results (default: "results")
 - `--max-concurrent`: Maximum number of concurrent API calls (default: 10)
 ## Output
 The framework generates two types of output files:
 1. Detailed results: `evaluation_{model}_{timestamp}.json`
   - Contains full response data and scoring for each question
 2. Summary: `summary_{model}_{timestamp}.json`
   - Contains aggregated metrics for each dataset
 ## Structure
 ```
 .
 ├── eval.py              # Main evaluation script
 ├── run_eval.sh          # Bash script for running evaluations
 ├── eval_basic.json      # Dataset configuration file
 └── results/             # Output directory
 ```