From fb40c8ca55f58eb68edbd3fcb47c9a2f7c6c4e71 Mon Sep 17 00:00:00 2001 From: rishabhranawat Date: Mon, 10 Feb 2025 21:57:00 -0800 Subject: [PATCH] [eval-v1] add a simple readme with some details --- eval/README.md | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 eval/README.md diff --git a/eval/README.md b/eval/README.md new file mode 100644 index 00000000..74cf4b86 --- /dev/null +++ b/eval/README.md @@ -0,0 +1,72 @@ +# Model Evaluation Framework + +A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API. + +## Overview + +This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports: +- Concurrent evaluation of multiple questions and datasets +- Customizable dataset configurations +- Automatic result aggregation and summary generation +- Rate limiting for API calls + +## Setup + +1. Set your OpenRouter API key as an environment variable: +```bash +export OPENROUTER_API_KEY=your-api-key +``` + +2. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`): +```json +[ + { + "name": "dataset_name", + "parameter1": "value1", + "parameter2": "value2" + } +] +``` + +## Usage + +### Running Evaluations + +You can run evaluations in two ways: + +1. Using the provided bash script: +```bash +./run_eval.sh +``` + +2. Running the Python script directly: +```bash +python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results" +``` + +### Command Line Arguments + +- `--model`: Model identifier (required) +- `--config`: Path to JSON configuration file (required) +- `--output-dir`: Directory for saving results (default: "results") +- `--max-concurrent`: Maximum number of concurrent API calls (default: 10) + +## Output + +The framework generates two types of output files: + +1. Detailed results: `evaluation_{model}_{timestamp}.json` + - Contains full response data and scoring for each question + +2. Summary: `summary_{model}_{timestamp}.json` + - Contains aggregated metrics for each dataset + +## Structure + +``` +. +├── eval.py # Main evaluation script +├── run_eval.sh # Bash script for running evaluations +├── eval_basic.json # Dataset configuration file +└── results/ # Output directory +```