mirror of
https://github.com/collinear-ai/yc-bench.git
synced 2026-04-28 17:29:35 +00:00
277 lines
9.5 KiB
Markdown
277 lines
9.5 KiB
Markdown
# Runner & Orchestration
|
||
|
||
**Location**: `src/yc_bench/runner/`
|
||
|
||
## Overview
|
||
|
||
The runner is the top-level orchestration layer that ties everything together: parsing arguments, loading configuration, initializing the database, seeding the world, starting the agent loop, and collecting results.
|
||
|
||
## Components
|
||
|
||
### Entry Point (`main.py`)
|
||
|
||
```python
|
||
def run_benchmark(args):
|
||
# 1. Load configuration
|
||
cfg = load_config(args.config)
|
||
|
||
# 2. Initialize database
|
||
engine, factory = init_db(db_path)
|
||
|
||
# 3. Seed world (employees + clients use fixed seed=1 for consistency;
|
||
# only task generation uses the run seed)
|
||
with session_scope(factory) as session:
|
||
seed_world_transactional(session, cfg, args.seed)
|
||
|
||
# 4. Build agent runtime
|
||
runtime = build_runtime(cfg.agent, args.model)
|
||
|
||
# 5. Start dashboard (if TTY)
|
||
dashboard = Dashboard(cfg) if is_tty() else None
|
||
|
||
# 6. Run agent loop
|
||
result = run_agent_loop(runtime, factory, cfg, dashboard)
|
||
|
||
# 7. Save results
|
||
save_rollout(result, args.output)
|
||
```
|
||
|
||
### Design Choices
|
||
|
||
#### Single-Command Invocation
|
||
|
||
```bash
|
||
uv run yc-bench run --model gemini/gemini-3-flash --seed 1 --config medium
|
||
```
|
||
|
||
**Why single command?** Benchmarks should be easy to reproduce. One command with explicit parameters (model, seed, config) fully specifies a run.
|
||
|
||
#### Database Per Run
|
||
|
||
Each run creates a fresh SQLite database:
|
||
|
||
```
|
||
db/run_seed1_medium_2025-03-15.sqlite
|
||
```
|
||
|
||
**Why per-run databases?**
|
||
- Isolation: runs can't interfere with each other
|
||
- Inspection: can analyze any run's final state after the fact
|
||
- Reproducibility: re-running with same seed produces identical database
|
||
- Parallelism: multiple runs can execute simultaneously
|
||
|
||
## Argument Parsing (`args.py`)
|
||
|
||
### Key Arguments
|
||
|
||
| Argument | Required | Description |
|
||
|----------|----------|-------------|
|
||
| `--model` | Yes | LLM model identifier (LiteLLM format) |
|
||
| `--seed` | Yes | Random seed for world generation |
|
||
| `--config` | No | Difficulty preset (default: "medium") |
|
||
| `--output` | No | Output path for rollout JSON |
|
||
| `--no-dashboard` | No | Disable live terminal UI |
|
||
| `--max-turns` | No | Override turn limit |
|
||
|
||
**Design choice**: Required arguments are minimal (model + seed). Everything else has sensible defaults. This reduces barrier to running benchmarks while allowing full customization.
|
||
|
||
## Dashboard (`dashboard.py`)
|
||
|
||
### Live Terminal UI
|
||
|
||
The dashboard uses [Rich](https://github.com/Textualize/rich) to display real-time simulation state:
|
||
|
||
```
|
||
┌─ YC-Bench Dashboard ──────────────────────────────┐
|
||
│ Model: claude-sonnet-4 Seed: 42 Config: medium │
|
||
│ Turn: 87/500 Sim Time: 2025-06-15 │
|
||
├────────────────────────────────────────────────────┤
|
||
│ Funds: $125,340 Runway: 4.2 months │
|
||
│ Prestige: R:5.2 I:3.8 D:2.1 T:6.4 │
|
||
│ Active Tasks: 3 Completed: 12 Failed: 1 │
|
||
├────────────────────────────────────────────────────┤
|
||
│ Last Action: task assign abc123 emp456 │
|
||
│ Last Event: task_completed (success) │
|
||
└────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Design choice**: The dashboard is for human observers, not the agent. It provides real-time visibility into benchmark runs without affecting agent behavior.
|
||
|
||
### Features
|
||
|
||
- Live fund tracking with trend indicators
|
||
- Prestige levels per domain
|
||
- Task status counters
|
||
- Recent agent actions
|
||
- Turn counter and simulation clock
|
||
- Auto-refreshes on each turn
|
||
|
||
### Conditional Activation
|
||
|
||
Dashboard only activates when running in a TTY (interactive terminal). Redirected output or CI environments get plain log output.
|
||
|
||
**Why conditional?** Batch runs (scripts/) shouldn't have terminal UI overhead. Detecting TTY ensures the right output mode automatically.
|
||
|
||
## Session Management (`session.py`)
|
||
|
||
### Run Session
|
||
|
||
Manages the lifecycle of a single benchmark run:
|
||
|
||
```python
|
||
class RunSession:
|
||
db_path: str
|
||
config: ExperimentConfig
|
||
model: str
|
||
seed: int
|
||
start_time: datetime
|
||
|
||
def save_rollout(self, result):
|
||
"""Save final rollout JSON to results/"""
|
||
|
||
def cleanup(self):
|
||
"""Clean up temporary resources"""
|
||
```
|
||
|
||
**Design choice**: Session object encapsulates all run-specific state, making it easy to serialize and manage runs.
|
||
|
||
## Bot Runner Baselines (`scripts/bot_runner.py`)
|
||
|
||
The bot runner provides deterministic heuristic baselines that operate under the **same constraints** as the LLM agent:
|
||
|
||
- Same market visibility (browse limit of 50, prestige/trust gating)
|
||
- Same economic rules (trust multiplier, work reduction, payroll, salary bumps)
|
||
- Same sim resume blocking (no time advance without active tasks)
|
||
- Direct DB access (bypasses CLI parsing overhead but applies identical logic)
|
||
|
||
### Available Strategies
|
||
|
||
| Strategy | Selection Heuristic |
|
||
|----------|-------------------|
|
||
| `greedy` | Highest reward among accessible tasks |
|
||
| `random` | Random selection (deterministic via seeded RNG) |
|
||
| `throughput` | Highest reward per estimated completion hour |
|
||
| `prestige` | Phase 1 (prestige < 5): fastest prestige gain. Phase 2: throughput |
|
||
|
||
### Greedy Baseline Design
|
||
|
||
The greedy bot is the **"zero strategy" floor** that any competent LLM agent should beat:
|
||
|
||
- **Sequential execution**: 1 task at a time (`MAX_CONCURRENT_TASKS = 1`)
|
||
- **1 task accepted per turn**: Mirrors the LLM's effective pace (browse → accept → assign → dispatch = ~1 task/turn)
|
||
- **All employees assigned**: Every employee works on the single active task
|
||
- **Prestige-aware browsing**: Filters market by `required_prestige <= floor(max_prestige)`, sorted by reward DESC
|
||
- **No completable filter**: All accessible tasks are candidates (blind to actual completion probability)
|
||
- **Tier-average rate estimation**: Uses `E[uniform(0, max_rate)]` per tier for ETA estimates (same information the LLM has)
|
||
- **Trust/prestige gating**: Respects the same acceptance requirements as the LLM
|
||
|
||
**Design choice**: The greedy bot is intentionally simple — it has no workload management, no client strategy, no domain alignment, and no long-term planning. It picks the highest-paying task it can access and throws all resources at it. This makes it a reliable floor: if an LLM agent can't beat "always pick the biggest number," the agent isn't adding strategic value.
|
||
|
||
### Usage
|
||
|
||
```bash
|
||
# Single strategy/config/seed
|
||
uv run python scripts/bot_runner.py --bot greedy --config medium --seed 1
|
||
|
||
# All strategies × all configs × all seeds
|
||
uv run python scripts/bot_runner.py
|
||
```
|
||
|
||
Output is written to `results/yc_bench_result_{config}_{seed}_{bot_slug}.json` in the same format as LLM runs, enabling direct comparison in plots.
|
||
|
||
## Batch Running (`scripts/`)
|
||
|
||
### Multi-Seed Runs
|
||
|
||
Scripts for running the same model across multiple seeds:
|
||
|
||
```bash
|
||
# Run seeds 1-10 with claude-sonnet on medium difficulty
|
||
for seed in $(seq 1 10); do
|
||
uv run yc-bench run --model anthropic/claude-sonnet-4-20250514 --seed $seed --config medium
|
||
done
|
||
```
|
||
|
||
### Multi-Model Comparison
|
||
|
||
Scripts for comparing models on the same seeds:
|
||
|
||
```bash
|
||
for model in "anthropic/claude-sonnet-4-20250514" "openai/gpt-4o" "google/gemini-pro"; do
|
||
uv run yc-bench run --model $model --seed 42 --config medium
|
||
done
|
||
```
|
||
|
||
**Design choice**: Simple shell scripts rather than a complex orchestration framework. This keeps the benchmark tooling minimal and transparent.
|
||
|
||
## Results & Output
|
||
|
||
### Rollout JSON
|
||
|
||
Each run produces a rollout file:
|
||
|
||
```
|
||
results/
|
||
├── claude-sonnet_seed1_medium.json
|
||
├── claude-sonnet_seed2_medium.json
|
||
├── gpt-4o_seed1_medium.json
|
||
└── ...
|
||
```
|
||
|
||
### Rollout Contents
|
||
|
||
```json
|
||
{
|
||
"metadata": {
|
||
"model": "anthropic/claude-sonnet-4-20250514",
|
||
"seed": 1,
|
||
"config": "medium",
|
||
"start_time": "2025-03-15T10:00:00",
|
||
"end_time": "2025-03-15T10:45:00"
|
||
},
|
||
"outcome": "horizon_end",
|
||
"final_state": {
|
||
"funds_cents": 25000000,
|
||
"prestige": {"research": 7.2, "inference": 5.1, ...},
|
||
"tasks_completed": 24,
|
||
"tasks_failed": 3,
|
||
"tasks_cancelled": 1,
|
||
"turns_used": 187
|
||
},
|
||
"transcript": [
|
||
{"turn": 1, "action": "company status", "result": {...}},
|
||
...
|
||
]
|
||
}
|
||
```
|
||
|
||
### Plots (`plots/`)
|
||
|
||
Visualization scripts for comparing model performance:
|
||
- Funds over time
|
||
- Prestige progression per domain
|
||
- Task completion rates
|
||
- Comparison charts across models/seeds
|
||
|
||
**Design choice**: Separate plotting from the benchmark runner. Results are stored as data (JSON); visualization is a post-processing step.
|
||
|
||
## Error Recovery
|
||
|
||
### Crash Recovery
|
||
|
||
If a run crashes (LLM timeout, OOM, etc.):
|
||
- The SQLite database persists with the last consistent state
|
||
- Rollout JSON may be partial but includes transcript up to the crash
|
||
- Re-running with the same seed starts fresh (no resume from crash)
|
||
|
||
**Design choice**: No crash recovery by design. Benchmark runs should be atomic -- either complete or re-run. This prevents partial results from contaminating comparisons.
|
||
|
||
### Graceful Shutdown
|
||
|
||
On SIGINT (Ctrl+C):
|
||
- Current turn completes
|
||
- Partial rollout is saved
|
||
- Database is committed
|
||
- Dashboard is cleaned up
|
||
|
||
**Design choice**: Graceful shutdown preserves whatever data exists, useful for debugging long runs that need to be interrupted.
|