Eval script consolidation (#238)

The script now supports:
   - YAML and JSON configurations
   - Dataset-specific parameters
   - Overriding configuration via command line
   - Detailed logging and error handling
This commit is contained in:
Andreas Köpf 2025-02-27 17:39:14 +01:00 committed by GitHub
parent 8a66d2a216
commit 850c1cf6f4
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
40 changed files with 1111 additions and 670 deletions

View file

@ -34,108 +34,100 @@ export OPENROUTER_API_KEY=your-api-key
```
4. Prepare your dataset configuration in YAML format (see examples in `yaml/<model_name>/algorithmic.yaml` e.g `yaml/r1/algorithmic.yaml`):
4. Prepare your evaluation configuration in YAML or JSON format (see example in `example_config.yaml`):
```yaml
model: model-name
provider: provider-name
category: category-name
datasets:
- dataset1
- dataset2
eval_dir: results/model-name
dataset_size: 50
dataset_seed: 42
developer_role: system
# Example configuration
model: "meta-llama/llama-3.3-70b-instruct"
provider: "Hyperbolic" # Optional, can be omitted
output_dir: "results"
max_concurrent: 10
default_size: 20 # Default size for all datasets
default_seed: 42 # Default seed for all datasets
```
For example the following file will run an evaluation for deepseek r1 for algorithmic datasets.
``` yaml
model: deepseek/deepseek-r1
provider: Nebius
category: algorithmic
datasets:
- ab
- base_conversion
- binary_matrix
- caesar_cipher
- count_primes
- game_of_life
- graph_color
- group_anagrams
- isomorphic_strings
- letter_counting
- letter_jumble
- manipulate_matrix
- number_filtering
- number_sorting
- palindrome
- pool_matrix
- ransom_note
- rotate_matrix
- sentence_reordering
- spell_backward
- spiral_matrix
- string_insertion
- string_manipulation
- string_synthesis
- word_ladder
- word_sequence_reversal
- word_sorting
eval_dir: results/deepseek-r1
dataset_size: 50
dataset_seed: 45
developer_role: system
categories:
- category: "algebra"
datasets:
- dataset: "complex_arithmetic"
params:
min_real: -10
max_real: 10
min_imag: -10
max_imag: 10
- category: "arithmetic"
datasets:
- dataset: "chain_sum"
size: 12
seed: 43
params:
min_digits: 2
allow_negation: true
- dataset: "products"
size: 10
seed: 43
params:
min_digits: 2
allow_negation: true
```
The following would run Claude 3.5 on the algorithmic dataset.
For example, to evaluate Claude 3.5 Sonnet on algorithmic datasets:
```yaml
model: anthropic/claude-3.5-sonnet
category: algorithmic
provider: Anthropic
datasets:
- count_primes
- game_of_life
- graph_color
- group_anagrams
- isomorphic_strings
- letter_counting
- letter_jumble
- manipulate_matrix
- number_filtering
- number_sorting
- palindrome
- pool_matrix
- ransom_note
- rotate_matrix
- sentence_reordering
- spell_backward
- spiral_matrix
- string_insertion
- string_manipulation
- string_synthesis
- word_ladder
- word_sequence_reversal
- word_sorting
eval_dir: results/claude-3.5-sonnet
dataset_size: 50
dataset_seed: 45
developer_role: system
model: "anthropic/claude-3.5-sonnet"
provider: "Anthropic"
output_dir: "results"
max_concurrent: 5
default_size: 50
default_seed: 45
categories:
- category: "algorithmic"
datasets:
- dataset: "count_primes"
- dataset: "game_of_life"
- dataset: "graph_color"
- dataset: "isomorphic_strings"
- dataset: "letter_jumble"
- dataset: "rotate_matrix"
- dataset: "sentence_reordering"
- dataset: "string_manipulation"
- dataset: "word_ladder"
- dataset: "word_sorting"
```
Here you specify individual model and provider
### Running Evaluations
To run evaluations
To run evaluations:
```bash
python eval.py --config configs/your_config.yaml
```
python eval.py --yaml <path-to yaml file>
For example:
```bash
python eval.py --config example_config.yaml --full-results
```
e.g
```
python eval.py --yaml yaml/r1/algorithmic.yaml
```
To run r1 evaluations on algorithmic.yaml
The results of individual model on a dataset will be stored in a new folder in the directory E.g `r1/algorithmic/proposition_logic.json`.
Please upload records of your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).
The results will be stored in a directory named after the model and timestamp, containing:
- `summary.json` - Summary of all results
- `results.json` - Full results (if `--full-results` is specified)
- Individual dataset results in category subdirectories
For example:
```
results/
└── meta-llama_llama-3.3-70b-instruct_20250227_162030/
├── summary.json
├── results.json
├── algebra/
│ └── complex_arithmetic.json
└── arithmetic/
├── chain_sum.json
└── products.json
```
Please upload your results to [reasoning-gym-eval](https://github.com/open-thought/reasoning-gym-eval).

View file

View file

@ -1,16 +1,40 @@
#!/usr/bin/env python
"""
Evaluation script for reasoning gym datasets.
This script evaluates LLM performance on reasoning gym datasets using the OpenRouter API.
Usage:
python eval.py --config config.yaml [options]
Options:
--model MODEL Override model specified in config
--output-dir DIR Override output directory specified in config
--max-concurrent NUM Maximum number of concurrent API calls
--save-metadata Save entry metadata in results
--full-results Save the full results file
--verbose Print detailed model responses
--debug Enable debug logging
Environment variables:
OPENROUTER_API_KEY Required API key for OpenRouter
"""
import argparse
import asyncio
import json
import logging
import os
from dataclasses import asdict
import subprocess
import sys
from collections import OrderedDict
from datetime import datetime
from typing import Any
from pathlib import Path
from typing import Any, Union
import aiohttp
from eval_config import EvalConfig
from tenacity import AsyncRetrying, retry_if_exception_type, stop_after_attempt, wait_exponential
from eval_config import CategoryConfig, DatasetConfig, EvalConfig
from openai import AsyncOpenAI
from tqdm.asyncio import tqdm_asyncio
import reasoning_gym
from reasoning_gym.utils import extract_answer
@ -22,130 +46,473 @@ logging.basicConfig(
handlers=[logging.StreamHandler()],
)
# httpx logging will be configured in the AsyncModelEvaluator class
# based on the debug flag
class OpenRouterEvaluator:
def __init__(self, model: str, config: EvalConfig, api_key: str):
self.logger = logging.getLogger(f"OpenRouterEvaluator.{model}")
def get_git_hash() -> str:
"""Get current git hash for reproducibility."""
cmd = ["git", "rev-parse", "HEAD"]
try:
return subprocess.check_output(cmd, text=True, stderr=subprocess.PIPE).strip()
except Exception:
return "unknown"
class AsyncModelEvaluator:
"""Evaluates models on reasoning datasets with async API calls via OpenRouter."""
def __init__(self, config: EvalConfig, verbose: bool = False, debug: bool = False):
"""Initialize the evaluator with configuration.
Args:
config: Evaluation configuration
verbose: Whether to print detailed model responses
debug: Whether to enable debug logging
"""
self.config = config
self.output_dir = f"{config.eval_dir}/{config.category}"
os.makedirs(self.output_dir, exist_ok=True)
self.base_url = "https://openrouter.ai/api/v1/chat/completions"
self.api_key = api_key
self.model = model
self.headers = {
"Authorization": f"Bearer {self.api_key}",
"HTTP-Referer": os.getenv("OR_SITE_URL", "localhost"),
"X-Title": os.getenv("OR_APP_NAME", "Model Evaluation"),
"Content-Type": "application/json",
}
self.semaphore = asyncio.Semaphore(15) # Control concurrency
self.verbose = verbose
self.debug = debug
def save_results(self, results: list[dict[str, Any]], dataset, dataset_name) -> dict[str, Any]:
file_name = f"{self.output_dir}/{dataset_name}.json"
total_score = sum(r["score"] for r in results)
# Set up logging
self.logger = logging.getLogger("AsyncModelEvaluator")
if debug:
self.logger.setLevel(logging.DEBUG)
# Enable httpx logs in debug mode
logging.getLogger("httpx").setLevel(logging.INFO)
else:
# Suppress httpx logs in normal mode
logging.getLogger("httpx").setLevel(logging.WARNING)
metrics = {
"dataset_name": dataset_name,
"model": self.model,
"size": dataset.size,
"provider": self.config.provider,
"average_score": total_score / len(results) if results else 0,
"total_examples": len(results),
"timestamp": datetime.now().isoformat(),
"config": asdict(dataset.config),
"results": results,
}
# Set up OpenRouter API client
api_key = os.getenv("OPENROUTER_API_KEY")
if not api_key:
raise ValueError("OPENROUTER_API_KEY environment variable is not set")
with open(file_name, "w") as f:
json.dump(metrics, f, indent=2)
return metrics
self.client = AsyncOpenAI(base_url="https://openrouter.ai/api/v1", api_key=api_key)
async def get_model_response(self, session: aiohttp.ClientSession, prompt: str) -> str:
payload = {
"model": self.model,
"messages": [
{"role": self.config.developer_role, "content": self.config.developer_prompt},
{"role": "user", "content": prompt},
],
"provider": {"order": [self.config.provider], "allow_fallbacks": False},
}
# Concurrency control
self.semaphore = asyncio.Semaphore(config.max_concurrent)
async for attempt in AsyncRetrying(
stop=stop_after_attempt(20),
wait=wait_exponential(multiplier=1, min=1, max=60),
retry=retry_if_exception_type(
(aiohttp.ClientError, asyncio.TimeoutError, json.JSONDecodeError, ValueError)
),
):
with attempt:
async with session.post(self.base_url, json=payload) as response:
data = await response.json()
# Metadata
self.git_hash = get_git_hash()
self.start_time = datetime.now()
if not data:
raise ValueError("Empty response")
async def get_model_response(self, prompt: str) -> str:
"""Get response from model with retry logic via OpenRouter.
if not data.get("choices"):
raise ValueError("Missing choices in response")
Args:
prompt: The prompt to send to the model
return data["choices"][0]["message"]["content"]
Returns:
The model's response text
raise Exception("Failed to get valid response after retries")
Raises:
Exception: If all retries fail
"""
max_retries = 10
base_delay = 1.0
max_delay = 60.0
backoff_factor = 2.0
async def process_entry(self, session: aiohttp.ClientSession, dataset: Any, entry: Any) -> dict[str, Any]:
"""Process a single entry with concurrency control."""
async with self.semaphore:
response = await self.get_model_response(session, entry["question"])
for attempt in range(max_retries):
try:
async with self.semaphore:
# Prepare API call parameters
params = {
"model": self.config.model,
"messages": [
{"role": self.config.system_role, "content": self.config.system_prompt},
{"role": "user", "content": prompt},
],
}
# Add provider configuration if specified
if self.config.provider:
params["extra_body"] = {"provider": {"order": [self.config.provider], "allow_fallbacks": False}}
completion = await self.client.chat.completions.create(**params)
response = completion.choices[0].message.content
if self.verbose:
self.logger.info(f"Prompt: {prompt}")
self.logger.info(f"Response: {response}")
return response
except Exception as e:
delay = min(max_delay, base_delay * (backoff_factor**attempt))
self.logger.warning(f"Attempt {attempt+1}/{max_retries} failed: {str(e)}")
self.logger.warning(f"Retrying in {delay:.2f} seconds...")
await asyncio.sleep(delay)
raise Exception(f"Failed to get model response after {max_retries} attempts")
async def process_entry(
self, dataset: reasoning_gym.dataset.ProceduralDataset, entry: dict[str, Any]
) -> dict[str, Any]:
"""Process a single dataset entry.
Args:
dataset: The dataset instance
entry: The entry to process
Returns:
Dict with processing results
"""
try:
response = await self.get_model_response(entry["question"])
model_answer = extract_answer(response)
score = dataset.score_answer(answer=model_answer, entry=entry)
print(f"answer: {model_answer}, score: {score}")
if self.verbose:
print(f"Question: {entry['question']}")
print(f"Expected: {entry['answer']}")
print(f"Answer: {model_answer}")
print(f"Score: {score}")
print("-" * 40)
return {
result = {
"question": entry["question"],
"expected_answer": str(entry["answer"]),
"model_answer": model_answer,
"full_model_response": response,
"score": score,
"metadata": str(entry["metadata"]),
}
async def evaluate_dataset(self, session: aiohttp.ClientSession, dataset_name: str) -> dict[str, Any]:
"""Evaluate a single dataset asynchronously."""
self.logger.info(f"\nEvaluating dataset: {dataset_name}")
dataset = reasoning_gym.create_dataset(
dataset_name, size=self.config.dataset_size, seed=self.config.dataset_seed
)
# Only include metadata if configured to do so
if self.config.save_metadata:
result["metadata"] = entry["metadata"]
tasks = [self.process_entry(session, dataset, entry) for entry in dataset]
results = await asyncio.gather(*tasks)
return self.save_results(results, dataset, dataset_name)
return result
async def evaluate_datasets(self) -> list[dict[str, Any]]:
"""Main async evaluation entry point."""
async with aiohttp.ClientSession(headers=self.headers) as session:
return await asyncio.gather(*(self.evaluate_dataset(session, name) for name in self.config.datasets))
except Exception as e:
self.logger.error(f"Error processing entry: {str(e)}")
result = {
"question": entry["question"],
"expected_answer": str(entry["answer"]),
"model_answer": "ERROR",
"full_model_response": f"Error: {str(e)}",
"score": 0.0,
"error": str(e),
}
# Only include metadata if configured to do so
if self.config.save_metadata:
result["metadata"] = entry["metadata"]
return result
async def evaluate_dataset(self, category_name: str, dataset_config: DatasetConfig) -> dict[str, Any]:
"""Evaluate a single dataset.
Args:
category_name: Name of the category
dataset_config: Configuration for the dataset
Returns:
Dict with evaluation results
"""
dataset_name = dataset_config.dataset
self.logger.info(f"Evaluating dataset: {dataset_name}")
try:
# Create dataset with all parameters
dataset_params = {}
# Add all parameters from the config params dictionary
# Make sure we don't have a nested 'params' dictionary
for k, v in dataset_config.params.items():
if k != "params":
dataset_params[k] = v
elif isinstance(v, dict):
# If there's a nested params dict, flatten it
dataset_params.update(v)
# Add size and seed if they're not None
if dataset_config.size is not None:
dataset_params["size"] = dataset_config.size
if dataset_config.seed is not None:
dataset_params["seed"] = dataset_config.seed
dataset = reasoning_gym.create_dataset(dataset_name, **dataset_params)
# Get all entries
all_entries = list(dataset)
# Process entries with progress bar
tasks = [self.process_entry(dataset, entry) for entry in all_entries]
results = await tqdm_asyncio.gather(*tasks, desc=f"Processing {dataset_name}", leave=True)
# Calculate metrics
total_score = sum(r["score"] for r in results)
average_score = total_score / len(results) if results else 0
return {
"name": dataset_name,
"category": category_name,
"average_score": average_score,
"total_examples": len(results),
"config": {"size": dataset_config.size, "seed": dataset_config.seed, **dataset_config.params},
"results": results,
}
except Exception as e:
self.logger.error(f"Error evaluating dataset {dataset_name}: {str(e)}")
return {
"name": dataset_name,
"category": category_name,
"average_score": 0.0,
"total_examples": 0,
"config": {"size": dataset_config.size, "seed": dataset_config.seed, **dataset_config.params},
"error": str(e),
"results": [],
}
async def evaluate_category(self, category_config: CategoryConfig) -> dict[str, Any]:
"""Evaluate all datasets in a category.
Args:
category_config: Configuration for the category
Returns:
Dict with category evaluation results
"""
category_name = category_config.category
self.logger.info(f"Evaluating category: {category_name}")
tasks = [self.evaluate_dataset(category_name, dataset_config) for dataset_config in category_config.datasets]
dataset_results = await asyncio.gather(*tasks)
return {
"name": category_name,
"datasets": dataset_results,
}
async def evaluate_all(self) -> dict[str, Any]:
"""Evaluate all categories and datasets.
Returns:
Dict with all evaluation results and summary
"""
self.logger.info(f"Starting evaluation of {len(self.config.categories)} categories")
tasks = [self.evaluate_category(category) for category in self.config.categories]
category_results = await asyncio.gather(*tasks)
# Generate results structure
results = {
"metadata": {
"timestamp": self.start_time.isoformat(),
"model": self.config.model,
"provider": self.config.provider,
"git_hash": self.git_hash,
"duration_seconds": (datetime.now() - self.start_time).total_seconds(),
},
"categories": category_results,
}
# Generate summary
results["summary"] = self.generate_summary(results)
return results
def generate_summary(self, results: dict[str, Any]) -> dict[str, Union[int, OrderedDict]]:
"""Generate a summary of evaluation results in the original configuration order.
Args:
results: The full evaluation results
Returns:
Dict with summary information
"""
summary = {
"total_datasets": 0,
"total_examples": 0,
"dataset_scores": OrderedDict(),
}
# Iterate through categories and datasets in the original order from config
for category_config in self.config.categories:
for dataset_config in category_config.datasets:
dataset_name = dataset_config.dataset
dataset_found = False
# Find corresponding results
for category in results["categories"]:
if category["name"] == category_config.category:
for dataset in category["datasets"]:
if dataset["name"] == dataset_name:
# Add to summary in original order
summary["dataset_scores"][dataset_name] = dataset["average_score"]
summary["total_datasets"] += 1
summary["total_examples"] += dataset["total_examples"]
dataset_found = True
break
# If dataset wasn't found in results (error), add with score 0
if not dataset_found:
summary["dataset_scores"][dataset_name] = 0.0
summary["total_datasets"] += 1
return summary
def save_results(self, results: dict[str, Any]) -> tuple[str, str]:
"""Save evaluation results to files.
Args:
results: The evaluation results to save
Returns:
Tuple of (results_path, summary_path)
"""
# Create output directory with timestamp
timestamp = self.start_time.strftime("%Y%m%d_%H%M%S")
model_name = self.config.model.replace("/", "_")
# Format directory name with model and timestamp only
output_dir = Path(self.config.output_dir) / f"{model_name}_{timestamp}"
output_dir.mkdir(parents=True, exist_ok=True)
results_path = None
# Save full results if configured to do so
if self.config.save_full_results:
results_path = output_dir / "results.json"
with open(results_path, "w") as f:
json.dump(results, f, indent=2)
# Add timestamp, git hash, model, provider, and duration to summary
summary_data = results["summary"].copy()
summary_data["timestamp"] = self.start_time.isoformat()
summary_data["git_hash"] = self.git_hash
summary_data["model"] = self.config.model
summary_data["provider"] = self.config.provider
summary_data["duration_seconds"] = results["metadata"]["duration_seconds"]
# Save summary
summary_path = output_dir / "summary.json"
with open(summary_path, "w") as f:
json.dump(summary_data, f, indent=2)
# Save individual dataset results
for category in results["categories"]:
category_dir = output_dir / category["name"]
category_dir.mkdir(exist_ok=True)
for dataset in category["datasets"]:
dataset_path = category_dir / f"{dataset['name']}.json"
with open(dataset_path, "w") as f:
json.dump(dataset, f, indent=2)
return str(results_path) if results_path else None, str(summary_path)
def print_summary(self, results: dict[str, Any]) -> None:
"""Print a summary of evaluation results to the console.
Args:
results: The evaluation results
"""
summary = results["summary"]
print("\nEvaluation Summary:")
print("------------------")
print(f"Model: {self.config.model}")
print(f"Provider: {self.config.provider}")
print(f"Git Hash: {self.git_hash}")
print(f"Duration: {results['metadata']['duration_seconds']:.2f} seconds")
print()
print("Dataset Scores (in configuration order):")
for dataset_name, score in summary["dataset_scores"].items():
# Find the number of examples for this dataset
examples = 0
for category in results["categories"]:
for dataset in category["datasets"]:
if dataset["name"] == dataset_name:
examples = dataset["total_examples"]
break
print(f" {dataset_name}: {score:.1%} ({examples} examples)")
print()
print(f"Total datasets: {summary['total_datasets']}")
print(f"Total examples: {summary['total_examples']}")
async def async_main():
api_key = os.getenv("OPENROUTER_API_KEY")
if not api_key:
print("Error: OPENROUTER_API_KEY environment variable is not set")
print("Please set it using: export OPENROUTER_API_KEY=your-api-key")
exit(1)
async def main_async():
"""Main async function."""
parser = argparse.ArgumentParser(description="Evaluate models on reasoning datasets")
parser.add_argument("--yaml", required=True, help="Path to YAML configuration file")
parser.add_argument("--config", required=True, help="Path to configuration file (YAML or JSON)")
parser.add_argument("--model", help="Override model specified in config")
parser.add_argument("--output-dir", help="Override output directory specified in config")
parser.add_argument("--max-concurrent", type=int, help="Maximum number of concurrent API calls")
parser.add_argument("--save-metadata", action="store_true", help="Save entry metadata in results")
parser.add_argument("--full-results", action="store_true", help="Save the full results file")
parser.add_argument("--verbose", action="store_true", help="Print detailed model responses")
parser.add_argument("--debug", action="store_true", help="Enable debug logging")
args = parser.parse_args()
config = EvalConfig.from_yaml(args.yaml)
evaluator = OpenRouterEvaluator(model=config.model, config=config, api_key=api_key)
results = await evaluator.evaluate_datasets()
# Check for required API key
if not os.getenv("OPENROUTER_API_KEY"):
print("Error: OPENROUTER_API_KEY environment variable is not set")
print("Please set it using: export OPENROUTER_API_KEY=your-api-key")
return 1
output_dir = f"{config.eval_dir}/{config.category}"
os.makedirs(output_dir, exist_ok=True)
with open(f"{output_dir}/summary.json", "w") as f:
json.dump(results, f, indent=2)
# Load configuration
config_path = args.config
if config_path.endswith(".yaml") or config_path.endswith(".yml"):
config = EvalConfig.from_yaml(config_path)
elif config_path.endswith(".json"):
config = EvalConfig.from_json(config_path)
else:
print("Error: Configuration file must be YAML or JSON")
return 1
# Apply command line overrides
if args.model:
config.model = args.model
if args.output_dir:
config.output_dir = args.output_dir
if args.max_concurrent:
config.max_concurrent = args.max_concurrent
if args.save_metadata:
config.save_metadata = True
if args.full_results:
config.save_full_results = True
# Create evaluator
evaluator = AsyncModelEvaluator(config=config, verbose=args.verbose, debug=args.debug)
# Run evaluation
try:
results = await evaluator.evaluate_all()
# Save and print results
results_path, summary_path = evaluator.save_results(results)
evaluator.print_summary(results)
if results_path:
print(f"\nResults saved to: {results_path}")
print(f"Summary saved to: {summary_path}")
return 0
except Exception as e:
print(f"Error during evaluation: {str(e)}")
if args.debug:
import traceback
traceback.print_exc()
return 1
def main():
"""Entry point."""
exit_code = asyncio.run(main_async())
sys.exit(exit_code)
if __name__ == "__main__":
asyncio.run(async_main())
main()

View file

@ -1,25 +1,139 @@
from dataclasses import dataclass
from typing import Union
"""Configuration classes for the evaluation script"""
import json
import re
from dataclasses import dataclass, field
from typing import Any, Optional
import yaml
from reasoning_gym.utils import SYSTEM_PROMPTS
def is_valid_unix_filename(filename: str) -> bool:
"""
Check for shell-safe filenames.
Only allows alphanumeric characters, hyphens, and underscores.
"""
if not filename:
return False
return bool(re.match(r"^[a-zA-Z0-9_-]+$", filename))
@dataclass
class DatasetConfig:
"""Configuration for a specific dataset"""
dataset: str
size: int = 500
seed: Optional[int] = None
# Allow any additional dataset-specific parameters
params: dict[str, Any] = field(default_factory=dict)
@dataclass
class CategoryConfig:
"""Configuration for a category of datasets"""
category: str
datasets: list[DatasetConfig]
@dataclass
class EvalConfig:
category: str
datasets: Union[str, list[str]]
eval_dir: str
dataset_size: int
dataset_seed: int
"""Global evaluation configuration"""
model: str
provider: str
developer_role: str = "system"
developer_prompt: str = SYSTEM_PROMPTS["DeepSeekZero"]
provider: Optional[str] = None
system_prompt: str = SYSTEM_PROMPTS["default"]
system_role: str = "system"
output_dir: str = "results"
max_concurrent: int = 10
default_size: int = 500
default_seed: Optional[int] = None
save_metadata: bool = False
save_full_results: bool = False
categories: list[CategoryConfig] = field(default_factory=list)
@classmethod
def from_yaml(cls, yaml_path: str):
def from_json(cls, json_path: str) -> "EvalConfig":
"""Load configuration from JSON file"""
with open(json_path, "r") as f:
config_data = json.load(f)
return cls._process_config_data(config_data)
@classmethod
def from_yaml(cls, yaml_path: str) -> "EvalConfig":
"""Load configuration from YAML file"""
with open(yaml_path, "r") as f:
config = yaml.safe_load(f)
return cls(**config)
config_data = yaml.safe_load(f)
return cls._process_config_data(config_data)
@classmethod
def _process_config_data(cls, config_data: dict[str, Any]) -> "EvalConfig":
"""Process configuration data from either JSON or YAML"""
# Extract categories
categories_data = config_data.pop("categories", [])
categories = []
for category_data in categories_data:
category_name = category_data.get("category")
if not is_valid_unix_filename(category_name):
raise ValueError(
f"Invalid category name '{category_name}'. Category names must be valid Unix filenames."
)
# Process datasets in this category
datasets_data = category_data.get("datasets", [])
datasets = []
for dataset_data in datasets_data:
# If it's just a string, convert to dict with name
if isinstance(dataset_data, str):
dataset_data = {"name": dataset_data}
# Extract dataset name
dataset_name = dataset_data.get("dataset")
# Extract size and seed with defaults
size = dataset_data.get("size", config_data.get("default_size", 500))
seed = dataset_data.get("seed", config_data.get("default_seed"))
# Extract all other parameters (everything except dataset, size, and seed)
# If there's a nested 'params' dictionary, use its contents directly
params = {}
for k, v in dataset_data.items():
if k not in ["dataset", "size", "seed"]:
if k == "params" and isinstance(v, dict):
# Flatten nested params dictionary
params.update(v)
else:
params[k] = v
# Create dataset config
dataset_config = DatasetConfig(
dataset=dataset_name,
size=size,
seed=seed,
params=params,
)
datasets.append(dataset_config)
# Create category config
category_config = CategoryConfig(category=category_name, datasets=datasets)
categories.append(category_config)
# Create main config
return cls(
model=config_data.get("model"),
provider=config_data.get("provider", "openai"),
system_prompt=config_data.get("system_prompt", SYSTEM_PROMPTS["default"]),
system_role=config_data.get("system_role", "system"),
output_dir=config_data.get("output_dir", "results"),
max_concurrent=config_data.get("max_concurrent", 10),
save_metadata=config_data.get("save_metadata", False),
save_full_results=config_data.get("save_full_results", False),
categories=categories,
)

47
eval/example_config.json Normal file
View file

@ -0,0 +1,47 @@
{
"model": "meta-llama/llama-3.3-70b-instruct",
"provider": "Hyperbolic",
"output_dir": "results",
"max_concurrent": 10,
"default_size": 20,
"default_seed": 42,
"categories": [
{
"category": "algebra",
"datasets": [
{
"dataset": "complex_arithmetic",
"params": {
"min_real": -10,
"max_real": 10,
"min_imag": -10,
"max_imag": 10
}
}
]
},
{
"category": "arithmetic",
"datasets": [
{
"dataset": "products",
"size": 10,
"seed": 43,
"params": {
"min_digits": 2,
"allow_negation": true
}
},
{
"dataset": "chain_sum",
"size": 12,
"seed": 43,
"params": {
"min_digits": 2,
"allow_negation": true
}
}
]
}
]
}

33
eval/example_config.yaml Normal file
View file

@ -0,0 +1,33 @@
# Example configuration for the evaluation script
model: "meta-llama/llama-3.3-70b-instruct"
provider: "Hyperbolic"
output_dir: "results"
max_concurrent: 10
default_size: 20 # Default size for all datasets
default_seed: 42 # Default seed for all datasets
categories:
- category: "algebra"
datasets:
- dataset: "complex_arithmetic"
params:
min_real: -10
max_real: 10
min_imag: -10
max_imag: 10
- category: "arithmetic"
datasets:
- dataset: "products"
size: 10
seed: 43
params:
min_digits: 2
allow_negation: true
- dataset: "chain_sum"
size: 12
seed: 43
params:
min_digits: 2
allow_negation: true

View file

@ -1,2 +1,3 @@
aiohttp>=3.11.13
tenacity>=9.0.0
openai>=1.64.0
PyYAML>=6.0
tqdm>=4.66.0

View file

@ -1,12 +0,0 @@
#!/bin/bash
# run this script from the parent directory
./eval.py --yaml yaml/llama-3.3-70b-instruct/algebra.yaml
./eval.py --yaml yaml/llama-3.3-70b-instruct/algorithmic.yaml
./eval.py --yaml yaml/llama-3.3-70b-instruct/arc.yaml
./eval.py --yaml yaml/llama-3.3-70b-instruct/arithmetic.yaml
./eval.py --yaml yaml/llama-3.3-70b-instruct/code.yaml
./eval.py --yaml yaml/llama-3.3-70b-instruct/cognition.yaml
./eval.py --yaml yaml/llama-3.3-70b-instruct/games.yaml
./eval.py --yaml yaml/llama-3.3-70b-instruct/geometry.yaml
./eval.py --yaml yaml/llama-3.3-70b-instruct/graphs.yaml
./eval.py --yaml yaml/llama-3.3-70b-instruct/logic.yaml

View file

@ -1,31 +0,0 @@
model: anthropic/claude-3.5-sonnet
category: algorithmic
provider: Anthropic
datasets:
- count_primes
- game_of_life
- graph_color
- group_anagrams
- isomorphic_strings
- letter_counting
- letter_jumble
- manipulate_matrix
- number_filtering
- number_sorting
- palindrome
- pool_matrix
- ransom_note
- rotate_matrix
- sentence_reordering
- spell_backward
- spiral_matrix
- string_insertion
- string_manipulation
- string_synthesis
- word_ladder
- word_sequence_reversal
- word_sorting
eval_dir: eval/sonnet-3.5
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -0,0 +1,34 @@
# Combined configuration for Claude 3.5 Sonnet
model: "anthropic/claude-3.5-sonnet"
provider: "Anthropic"
output_dir: "results"
max_concurrent: 10
default_size: 50
default_seed: 45
categories:
- category: "algorithmic"
datasets:
- dataset: "count_primes"
- dataset: "game_of_life"
- dataset: "graph_color"
- dataset: "group_anagrams"
- dataset: "isomorphic_strings"
- dataset: "letter_counting"
- dataset: "letter_jumble"
- dataset: "manipulate_matrix"
- dataset: "number_filtering"
- dataset: "number_sorting"
- dataset: "palindrome"
- dataset: "pool_matrix"
- dataset: "ransom_note"
- dataset: "rotate_matrix"
- dataset: "sentence_reordering"
- dataset: "spell_backward"
- dataset: "spiral_matrix"
- dataset: "string_insertion"
- dataset: "string_manipulation"
- dataset: "string_synthesis"
- dataset: "word_ladder"
- dataset: "word_sequence_reversal"
- dataset: "word_sorting"

View file

@ -0,0 +1,61 @@
# Combined configuration for deepseek-r1
model: "deepseek/deepseek-r1"
provider: "Nebius"
output_dir: "results"
max_concurrent: 10
default_size: 50
default_seed: 45
categories:
- category: "algebra"
datasets:
- dataset: "intermediate_integration"
- dataset: "polynomial_equations"
- dataset: "polynomial_multiplication"
- dataset: "simple_equations"
- dataset: "simple_integration"
- dataset: "complex_arithmetic"
- category: "algorithmic"
datasets:
- dataset: "ab"
- dataset: "base_conversion"
- dataset: "binary_matrix"
- dataset: "caesar_cipher"
- dataset: "count_primes"
- dataset: "game_of_life"
- dataset: "graph_color"
- dataset: "group_anagrams"
- dataset: "isomorphic_strings"
- dataset: "letter_counting"
- dataset: "letter_jumble"
- dataset: "manipulate_matrix"
- dataset: "number_filtering"
- dataset: "number_sorting"
- dataset: "palindrome"
- dataset: "pool_matrix"
- dataset: "ransom_note"
- dataset: "rotate_matrix"
- dataset: "sentence_reordering"
- dataset: "spell_backward"
- dataset: "spiral_matrix"
- dataset: "string_insertion"
- dataset: "string_manipulation"
- dataset: "string_synthesis"
- dataset: "word_ladder"
- dataset: "word_sequence_reversal"
- dataset: "word_sorting"
- category: "cognition"
datasets:
- dataset: "color_cube_rotation"
- dataset: "figlet_font"
- dataset: "number_sequence"
- dataset: "rubiks_cube"
- category: "logic"
datasets:
- dataset: "propositional_logic"
- dataset: "self_reference"
- dataset: "syllogism"
- dataset: "zebra_puzzles"

View file

@ -1,9 +0,0 @@
model: anthropic/claude-3.7-sonnet # find model id: https://openrouter.ai/models
provider: Anthropic
category: test
datasets:
- YOUR_DATASET_NAME
eval_dir: results/test
dataset_size: 100
dataset_seed: 42
developer_role: system

View file

@ -0,0 +1,130 @@
# Combined configuration for llama-3.3-70b-instruct
model: "meta-llama/llama-3.3-70b-instruct"
provider: "Hyperbolic"
output_dir: "results"
max_concurrent: 10
default_size: 50
default_seed: 45
categories:
- category: "algebra"
datasets:
- dataset: "intermediate_integration"
- dataset: "polynomial_equations"
- dataset: "polynomial_multiplication"
- dataset: "simple_equations"
- dataset: "simple_integration"
- dataset: "complex_arithmetic"
- category: "algorithmic"
datasets:
- dataset: "ab"
- dataset: "base_conversion"
- dataset: "binary_alternation"
- dataset: "binary_matrix"
- dataset: "caesar_cipher"
- dataset: "count_primes"
- dataset: "cryptarithm"
- dataset: "game_of_life"
- dataset: "graph_color"
- dataset: "group_anagrams"
- dataset: "isomorphic_strings"
- dataset: "jugs"
- dataset: "letter_counting"
- dataset: "letter_jumble"
- dataset: "manipulate_matrix"
- dataset: "number_filtering"
- dataset: "number_sorting"
- dataset: "palindrome"
- dataset: "palindrome_partitioning"
- dataset: "pool_matrix"
- dataset: "ransom_note"
- dataset: "rotate_matrix"
- dataset: "rotten_oranges"
- dataset: "sentence_reordering"
- dataset: "spell_backward"
- dataset: "spiral_matrix"
- dataset: "string_insertion"
- dataset: "string_manipulation"
- dataset: "string_splitting"
- dataset: "string_synthesis"
- dataset: "word_ladder"
- dataset: "word_sequence_reversal"
- dataset: "word_sorting"
- category: "arc"
datasets:
- dataset: "arc_1d"
- dataset: "arc_agi"
- dataset: "rearc"
- category: "arithmetic"
datasets:
- dataset: "basic_arithmetic"
- dataset: "bitwise_arithmetic"
- dataset: "calendar_arithmetic"
- dataset: "chain_sum"
- dataset: "count_bits"
- dataset: "decimal_arithmetic"
- dataset: "decimal_chain_sum"
- dataset: "dice"
- dataset: "fraction_simplification"
- dataset: "gcd"
- dataset: "gsm_symbolic"
- dataset: "lcm"
- dataset: "leg_counting"
- dataset: "number_format"
- dataset: "power_function"
- dataset: "prime_factorization"
- dataset: "products"
- dataset: "time_intervals"
- category: "code"
datasets:
- dataset: "bf"
- category: "cognition"
datasets:
- dataset: "color_cube_rotation"
- dataset: "figlet_font"
- dataset: "needle_haystack"
- dataset: "number_sequence"
- dataset: "rectangle_count"
- dataset: "rubiks_cube"
- category: "games"
datasets:
- dataset: "countdown"
- dataset: "emoji_mystery"
- dataset: "futoshuki"
- dataset: "knight_swap"
- dataset: "maze"
- dataset: "mini_sudoku"
- dataset: "n_queens"
- dataset: "sokoban"
- dataset: "sudoku"
- dataset: "tower_of_hanoi"
- dataset: "tsumego"
- category: "geometry"
datasets:
- dataset: "simple_geometry"
- dataset: "advanced_geometry"
- category: "graphs"
datasets:
- dataset: "course_schedule"
- dataset: "family_relationships"
- dataset: "largest_island"
- dataset: "list_functions"
- dataset: "quantum_lock"
- dataset: "shortest_path"
- category: "logic"
datasets:
- dataset: "aiw"
- dataset: "circuit_logic"
- dataset: "propositional_logic"
- dataset: "self_reference"
- dataset: "syllogism"
- dataset: "zebra_puzzles"

View file

@ -1,14 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: algebra
datasets:
- intermediate_integration
- polynomial_equations
- polynomial_multiplication
- simple_equations
- simple_integration
- complex_arithmetic
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 42
developer_role: system

View file

@ -1,41 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: algorithmic
datasets:
- ab
- base_conversion
- binary_alternation
- binary_matrix
- caesar_cipher
- count_primes
- cryptarithm
- game_of_life
- graph_color
- group_anagrams
- isomorphic_strings
- jugs
- letter_counting
- letter_jumble
- manipulate_matrix
- number_filtering
- number_sorting
- palindrome
- palindrome_partitioning
- pool_matrix
- ransom_note
- rotate_matrix
- rotten_oranges
- sentence_reordering
- spell_backward
- spiral_matrix
- string_insertion
- string_manipulation
- string_splitting
- string_synthesis
- word_ladder
- word_sequence_reversal
- word_sorting
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,11 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: arc
datasets:
- arc_1d
- arc_agi
- rearc
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,26 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: arithmetic
datasets:
- basic_arithmetic
- bitwise_arithmetic
- calendar_arithmetic
- chain_sum
- count_bits
- decimal_arithmetic
- decimal_chain_sum
- dice
- fraction_simplification
- gcd
- gsm_symbolic
- lcm
- leg_counting
- number_format
- power_function
- prime_factorization
- products
- time_intervals
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,9 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: code
datasets:
- bf
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,14 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: cognition
datasets:
- color_cube_rotation
- figlet_font
- needle_haystack
- number_sequence
- rectangle_count
- rubiks_cube
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,19 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: games
datasets:
- countdown
- emoji_mystery
- futoshuki
- knight_swap
- maze
- mini_sudoku
- n_queens
- sokoban
- sudoku
- tower_of_hanoi
- tsumego
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,10 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: geometry
datasets:
- simple_geometry
- advanced_geometry
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,14 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: graphs
datasets:
- course_schedule
- family_relationships
- largest_island
- list_functions
- quantum_lock
- shortest_path
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,14 +0,0 @@
model: meta-llama/llama-3.3-70b-instruct
provider: Hyperbolic
category: logic
datasets:
- aiw
- circuit_logic
- propositional_logic
- self_reference
- syllogism
- zebra_puzzles
eval_dir: results/llama-3.3-70b-instruct
dataset_size: 50
dataset_seed: 45
developer_role: system

126
eval/yaml/openai-o3.yaml Normal file
View file

@ -0,0 +1,126 @@
# Combined configuration for openai/o3-mini
model: "openai/o3-mini"
provider: "OpenAI"
output_dir: "results"
max_concurrent: 10
default_size: 50
default_seed: 45
categories:
- category: "algebra"
datasets:
- dataset: "complex_arithmetic"
- dataset: "intermediate_integration"
- dataset: "polynomial_equations"
- dataset: "polynomial_multiplication"
- dataset: "simple_equations"
- dataset: "simple_integration"
- category: "algorithmic"
datasets:
- dataset: "ab"
- dataset: "binary_alternation"
- dataset: "base_conversion"
- dataset: "binary_matrix"
- dataset: "caesar_cipher"
- dataset: "count_primes"
- dataset: "cryptarithm"
- dataset: "game_of_life"
- dataset: "graph_color"
- dataset: "group_anagrams"
- dataset: "isomorphic_strings"
- dataset: "letter_counting"
- dataset: "letter_jumble"
- dataset: "manipulate_matrix"
- dataset: "number_filtering"
- dataset: "number_sorting"
- dataset: "palindrome"
- dataset: "pool_matrix"
- dataset: "ransom_note"
- dataset: "rotate_matrix"
- dataset: "sentence_reordering"
- dataset: "spell_backward"
- dataset: "spiral_matrix"
- dataset: "string_insertion"
- dataset: "string_manipulation"
- dataset: "string_synthesis"
- dataset: "word_ladder"
- dataset: "word_sequence_reversal"
- dataset: "word_sorting"
- category: "arc"
datasets:
- dataset: "arc_1d"
- dataset: "arc_agi"
- dataset: "rearc"
- category: "arithmetic"
datasets:
- dataset: "basic_arithmetic"
- dataset: "bitwise_arithmetic"
- dataset: "calendar_arithmetic"
- dataset: "chain_sum"
- dataset: "count_bits"
- dataset: "decimal_arithmetic"
- dataset: "decimal_chain_sum"
- dataset: "dice"
- dataset: "fraction_simplification"
- dataset: "gcd"
- dataset: "gsm_symbolic"
- dataset: "lcm"
- dataset: "leg_counting"
- dataset: "number_format"
- dataset: "power_function"
- dataset: "prime_factorization"
- dataset: "products"
- dataset: "time_intervals"
- category: "code"
datasets:
- dataset: "bf"
- category: "cognition"
datasets:
- dataset: "color_cube_rotation"
- dataset: "figlet_font"
- dataset: "needle_haystack"
- dataset: "number_sequence"
- dataset: "rectangle_count"
- dataset: "rubiks_cube"
- category: "games"
datasets:
- dataset: "countdown"
- dataset: "emoji_mystery"
- dataset: "futoshuki"
- dataset: "knight_swap"
- dataset: "maze"
- dataset: "mini_sudoku"
- dataset: "n_queens"
- dataset: "sokoban"
- dataset: "sudoku"
- dataset: "tower_of_hanoi"
- dataset: "tsumego"
- category: "geometry"
datasets:
- dataset: "simple_geometry"
- dataset: "advanced_geometry"
- category: "graphs"
datasets:
- dataset: "course_schedule"
- dataset: "family_relationships"
- dataset: "largest_island"
- dataset: "list_functions"
- dataset: "quantum_lock"
- dataset: "shortest_path"
- category: "logic"
datasets:
- dataset: "aiw"
- dataset: "circuit_logic"
- dataset: "propositional_logic"
- dataset: "self_reference"
- dataset: "syllogism"
- dataset: "zebra_puzzles"

View file

@ -1,14 +0,0 @@
model: openai/o3-mini
category: algebra
provider: OpenAI
datasets:
- complex_arithmetic
- intermediate_integration
- polynomial_equations
- polynomial_multiplication
- simple_equations
- simple_integration
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,37 +0,0 @@
model: openai/o3-mini
category: algorithmic
provider: OpenAI
datasets:
- ab
- binary_alternation
- base_conversion
- binary_matrix
- caesar_cipher
- count_primes
- cryptarithm
- game_of_life
- graph_color
- group_anagrams
- isomorphic_strings
- letter_counting
- letter_jumble
- manipulate_matrix
- number_filtering
- number_sorting
- palindrome
- pool_matrix
- ransom_note
- rotate_matrix
- sentence_reordering
- spell_backward
- spiral_matrix
- string_insertion
- string_manipulation
- string_synthesis
- word_ladder
- word_sequence_reversal
- word_sorting
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,11 +0,0 @@
model: openai/o3-mini
category: arc
provider: OpenAI
datasets:
- arc_1d
- arc_agi
- rearc
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,26 +0,0 @@
model: openai/o3-mini
category: arithmetic
provider: OpenAI
datasets:
- basic_arithmetic
- bitwise_arithmetic
- calendar_arithmetic
- chain_sum
- count_bits
- decimal_arithmetic
- decimal_chain_sum
- dice
- fraction_simplification
- gcd
- gsm_symbolic
- lcm
- leg_counting
- number_format
- power_function
- prime_factorization
- products
- time_intervals
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,9 +0,0 @@
model: openai/o3-mini
category: code
provider: OpenAI
datasets:
- bf
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,14 +0,0 @@
model: openai/o3-mini
category: cognition
provider: OpenAI
datasets:
- color_cube_rotation
- figlet_font
- needle_haystack
- number_sequence
- rectangle_count
- rubiks_cube
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,19 +0,0 @@
model: openai/o3-mini
category: games
provider: OpenAI
datasets:
- countdown
- emoji_mystery
- futoshuki
- knight_swap
- maze
- mini_sudoku
- n_queens
- sokoban
- sudoku
- tower_of_hanoi
- tsumego
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,10 +0,0 @@
model: openai/o3-mini
category: geometry
provider: OpenAI
datasets:
- simple_geometry
- advanced_geometry
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,14 +0,0 @@
model: openai/o3-mini
category: graphs
provider: OpenAI
datasets:
- course_schedule
- family_relationships
- largest_island
- list_functions
- quantum_lock
- shortest_path
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,14 +0,0 @@
model: openai/o3-mini
category: logic
provider: OpenAI
datasets:
- aiw
- circuit_logic
- propositional_logic
- self_reference
- syllogism
- zebra_puzzles
eval_dir: results/openai-03
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,14 +0,0 @@
model: deepseek/deepseek-r1
provider: Nebius
category: algebra
datasets:
- intermediate_integration
- polynomial_equations
- polynomial_multiplication
- simple_equations
- simple_integration
- complex_arithmetic
eval_dir: results/r1
dataset_size: 50
dataset_seed: 42
developer_role: system

View file

@ -1,35 +0,0 @@
model: deepseek/deepseek-r1
provider: Nebius
category: algorithmic
datasets:
- ab
- base_conversion
- binary_matrix
- caesar_cipher
- count_primes
- game_of_life
- graph_color
- group_anagrams
- isomorphic_strings
- letter_counting
- letter_jumble
- manipulate_matrix
- number_filtering
- number_sorting
- palindrome
- pool_matrix
- ransom_note
- rotate_matrix
- sentence_reordering
- spell_backward
- spiral_matrix
- string_insertion
- string_manipulation
- string_synthesis
- word_ladder
- word_sequence_reversal
- word_sorting
eval_dir: results/r1
dataset_size: 50
dataset_seed: 45
developer_role: system

View file

@ -1,12 +0,0 @@
model: deepseek/deepseek-r1
provider: Nebius
category: cognition
datasets:
- color_cube_rotation
- figlet_font
- number_sequence
- rubiks_cube
eval_dir: results/r1
dataset_size: 50
dataset_seed: 42
developer_role: system

View file

@ -1,12 +0,0 @@
model: deepseek/deepseek-r1
provider: Nebius
category: logic
datasets:
- propositional_logic
- self_reference
- syllogism
- zebra_puzzles
eval_dir: results/r1
dataset_size: 50
dataset_seed: 42
developer_role: system

View file

@ -37,7 +37,7 @@ class ProceduralDataset(ABC, Sized, Iterable[dict[str, Any]]):
return item
@abstractmethod
def __getitem__(self, idx: int) -> dict:
def __getitem__(self, idx: int) -> dict[str, Any]:
"""Generate a single dataset item
Args:

View file

@ -15,6 +15,7 @@ Once you have thought about the reasoning process, provide the answer in the fol
<answer>answer here</answer>
Do not explain your reasoning inside the answer tags, provide only the final answer. When an example is provided, you should strictly follow the format of the output/answer in that example.
""",
"simple": "You are a helpful assistant that answers questions accurately and concisely. When asked to solve a problem, show your work step by step. Provide your final answer between <answer> and </answer> tags.",
}