From fb40c8ca55f58eb68edbd3fcb47c9a2f7c6c4e71 Mon Sep 17 00:00:00 2001
From: rishabhranawat <rishabhranawat12345@gmail.com>
Date: Mon, 10 Feb 2025 21:57:00 -0800
Subject: [PATCH] [eval-v1] add a simple readme with some details

---
 eval/README.md | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)
 create mode 100644 eval/README.md

diff --git a/eval/README.md b/eval/README.md
new file mode 100644
index 00000000..74cf4b86
--- /dev/null
+++ b/eval/README.md
@@ -0,0 +1,72 @@
+# Model Evaluation Framework
+
+A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API.
+
+## Overview
+
+This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports:
+- Concurrent evaluation of multiple questions and datasets
+- Customizable dataset configurations
+- Automatic result aggregation and summary generation
+- Rate limiting for API calls
+
+## Setup
+
+1. Set your OpenRouter API key as an environment variable:
+```bash
+export OPENROUTER_API_KEY=your-api-key
+```
+
+2. Prepare your dataset configuration in JSON format (e.g., `eval_basic.json`):
+```json
+[
+  {
+    "name": "dataset_name",
+    "parameter1": "value1",
+    "parameter2": "value2"
+  }
+]
+```
+
+## Usage
+
+### Running Evaluations
+
+You can run evaluations in two ways:
+
+1. Using the provided bash script:
+```bash
+./run_eval.sh
+```
+
+2. Running the Python script directly:
+```bash
+python eval.py --model "model-name" --config "eval_basic.json" --output-dir "results"
+```
+
+### Command Line Arguments
+
+- `--model`: Model identifier (required)
+- `--config`: Path to JSON configuration file (required)
+- `--output-dir`: Directory for saving results (default: "results")
+- `--max-concurrent`: Maximum number of concurrent API calls (default: 10)
+
+## Output
+
+The framework generates two types of output files:
+
+1. Detailed results: `evaluation_{model}_{timestamp}.json`
+   - Contains full response data and scoring for each question
+
+2. Summary: `summary_{model}_{timestamp}.json`
+   - Contains aggregated metrics for each dataset
+
+## Structure
+
+```
+.
+├── eval.py              # Main evaluation script
+├── run_eval.sh          # Bash script for running evaluations
+├── eval_basic.json      # Dataset configuration file
+└── results/             # Output directory
+```