# Model Evaluation Framework A simple asynchronous framework for evaluating language models on reasoning tasks using the OpenRouter API. ## Evaluation Results Repository In order to keep the main repo clean and not clutter it with evaluation traces from different models, we store all evaluation results in a separate repository: [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval). If you run evaluations and want to contribute your results, please create a pull request in the [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval) repository, not in the main reasoning-gym repo. ## Overview This framework provides tools to evaluate language models on the reasoning_gym datasets. It supports: - Concurrent evaluation of multiple questions and datasets - Customizable dataset configurations - Automatic result aggregation and summary generation - Rate limiting for API calls ## Setup 1. Install reasoning-gym in development mode: ```bash pip install -e .. ``` 2. Install the additional dependencies required for evaluation: ```bash pip install -r requirements-eval.txt ``` 3. Set your API key (if required by the API): For OpenRouter, you can set it as an environment variable: ```bash export OPENROUTER_API_KEY=your-api-key ``` Or provide it directly when running the script: ```bash python eval.py --config your_config.yaml --api-key your-api-key ``` Note: API key is optional for some APIs (e.g., local deployments). 4. Prepare your evaluation configuration in YAML or JSON format (see example in `example_config.yaml`): ```yaml # Example configuration model: "meta-llama/llama-3.3-70b-instruct" provider: "Hyperbolic" # Optional, can be omitted output_dir: "results" max_concurrent: 10 default_size: 20 # Default size for all datasets default_seed: 42 # Default seed for all datasets max_tokens: 32768 # Maximum generation length (optional) temperature: 0.6 # Generation temperature (optional) top_p: 0.95 # Top-p sampling parameter (optional) system_prompt_id: "default" # Use a predefined system prompt by ID (optional) # system_prompt: "Your custom system prompt here" # Or specify a custom system prompt directly categories: - category: "algebra" datasets: - dataset: "complex_arithmetic" params: min_real: -10 max_real: 10 min_imag: -10 max_imag: 10 - category: "arithmetic" datasets: - dataset: "chain_sum" size: 12 seed: 43 params: min_digits: 2 allow_negation: true - dataset: "products" size: 10 seed: 43 params: min_digits: 2 allow_negation: true ``` For example, to evaluate Claude 3.5 Sonnet on algorithmic datasets: ```yaml model: "anthropic/claude-3.5-sonnet" provider: "Anthropic" output_dir: "results" max_concurrent: 5 default_size: 50 default_seed: 45 categories: - category: "algorithmic" datasets: - dataset: "count_primes" - dataset: "game_of_life" - dataset: "graph_color" - dataset: "isomorphic_strings" - dataset: "letter_jumble" - dataset: "rotate_matrix" - dataset: "sentence_reordering" - dataset: "string_manipulation" - dataset: "word_ladder" - dataset: "word_sorting" ``` ### Generating Configurations You can generate a configuration file with all registered datasets using the `generate_config.py` script: ```bash python generate_config.py --output my_config.yaml --model "anthropic/claude-3.5-sonnet" --provider "Anthropic" --size 50 --seed 42 ``` Options: - `--output`: Output YAML file path (default: all_datasets.yaml) - `--model`: Model name (default: openai/gpt-4) - `--provider`: Provider name (default: None) - `--size`: Default dataset size (default: 100) - `--seed`: Default dataset seed (default: 42) - `--include-params`: Include all configuration parameters (default: False) - `--category`: Only include datasets from this category (default: None) #### Generating Config for a Specific Category To generate a configuration file containing only datasets from a specific category: ```bash python generate_config.py --category algorithmic --output algorithmic_datasets.yaml --model "anthropic/claude-3.5-sonnet" ``` This will create a configuration file that includes only datasets in the "algorithmic" category. This is useful when you want to focus your evaluation on a specific type of reasoning tasks. Example categories include: math, arithmetic, reasoning, algorithmic, etc. The category is automatically extracted from the dataset's module name (e.g., from `reasoning_gym.math.dataset_name`, it extracts "math"). You can see all available categories by running the script without the `--category` option, as it will print all categories at the end of execution. ### Running Evaluations ```bash python eval.py --config configs/your_config.yaml ``` For example: ```bash python eval.py --config example_config.yaml --full-results ``` You can specify a different API base URL if needed: ```bash python eval.py --config example_config.yaml --base-url "https://api.together.xyz/v1" --api-key "your-together-api-key" ``` The results will be stored in a directory named after the model and timestamp, containing: - `summary.json` - Summary of all results - `results.json` - Full results (if `--full-results` is specified) - Individual dataset results in category subdirectories For example: ``` results/ └── meta-llama_llama-3.3-70b-instruct_20250227_162030/ ├── summary.json ├── results.json ├── algebra/ │ └── complex_arithmetic.json └── arithmetic/ ├── chain_sum.json └── products.json ``` Please upload your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).