yc-bench/system_design/10_runner_orchestration.md
2026-03-19 18:39:57 -07:00

277 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runner & Orchestration
**Location**: `src/yc_bench/runner/`
## Overview
The runner is the top-level orchestration layer that ties everything together: parsing arguments, loading configuration, initializing the database, seeding the world, starting the agent loop, and collecting results.
## Components
### Entry Point (`main.py`)
```python
def run_benchmark(args):
# 1. Load configuration
cfg = load_config(args.config)
# 2. Initialize database
engine, factory = init_db(db_path)
# 3. Seed world (employees + clients use fixed seed=1 for consistency;
# only task generation uses the run seed)
with session_scope(factory) as session:
seed_world_transactional(session, cfg, args.seed)
# 4. Build agent runtime
runtime = build_runtime(cfg.agent, args.model)
# 5. Start dashboard (if TTY)
dashboard = Dashboard(cfg) if is_tty() else None
# 6. Run agent loop
result = run_agent_loop(runtime, factory, cfg, dashboard)
# 7. Save results
save_rollout(result, args.output)
```
### Design Choices
#### Single-Command Invocation
```bash
uv run yc-bench run --model gemini/gemini-3-flash --seed 1 --config medium
```
**Why single command?** Benchmarks should be easy to reproduce. One command with explicit parameters (model, seed, config) fully specifies a run.
#### Database Per Run
Each run creates a fresh SQLite database:
```
db/run_seed1_medium_2025-03-15.sqlite
```
**Why per-run databases?**
- Isolation: runs can't interfere with each other
- Inspection: can analyze any run's final state after the fact
- Reproducibility: re-running with same seed produces identical database
- Parallelism: multiple runs can execute simultaneously
## Argument Parsing (`args.py`)
### Key Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| `--model` | Yes | LLM model identifier (LiteLLM format) |
| `--seed` | Yes | Random seed for world generation |
| `--config` | No | Difficulty preset (default: "medium") |
| `--output` | No | Output path for rollout JSON |
| `--no-dashboard` | No | Disable live terminal UI |
| `--max-turns` | No | Override turn limit |
**Design choice**: Required arguments are minimal (model + seed). Everything else has sensible defaults. This reduces barrier to running benchmarks while allowing full customization.
## Dashboard (`dashboard.py`)
### Live Terminal UI
The dashboard uses [Rich](https://github.com/Textualize/rich) to display real-time simulation state:
```
┌─ YC-Bench Dashboard ──────────────────────────────┐
│ Model: claude-sonnet-4 Seed: 42 Config: medium │
│ Turn: 87/500 Sim Time: 2025-06-15 │
├────────────────────────────────────────────────────┤
│ Funds: $125,340 Runway: 4.2 months │
│ Prestige: R:5.2 I:3.8 D:2.1 T:6.4 │
│ Active Tasks: 3 Completed: 12 Failed: 1 │
├────────────────────────────────────────────────────┤
│ Last Action: task assign abc123 emp456 │
│ Last Event: task_completed (success) │
└────────────────────────────────────────────────────┘
```
**Design choice**: The dashboard is for human observers, not the agent. It provides real-time visibility into benchmark runs without affecting agent behavior.
### Features
- Live fund tracking with trend indicators
- Prestige levels per domain
- Task status counters
- Recent agent actions
- Turn counter and simulation clock
- Auto-refreshes on each turn
### Conditional Activation
Dashboard only activates when running in a TTY (interactive terminal). Redirected output or CI environments get plain log output.
**Why conditional?** Batch runs (scripts/) shouldn't have terminal UI overhead. Detecting TTY ensures the right output mode automatically.
## Session Management (`session.py`)
### Run Session
Manages the lifecycle of a single benchmark run:
```python
class RunSession:
db_path: str
config: ExperimentConfig
model: str
seed: int
start_time: datetime
def save_rollout(self, result):
"""Save final rollout JSON to results/"""
def cleanup(self):
"""Clean up temporary resources"""
```
**Design choice**: Session object encapsulates all run-specific state, making it easy to serialize and manage runs.
## Bot Runner Baselines (`scripts/bot_runner.py`)
The bot runner provides deterministic heuristic baselines that operate under the **same constraints** as the LLM agent:
- Same market visibility (browse limit of 50, prestige/trust gating)
- Same economic rules (trust multiplier, work reduction, payroll, salary bumps)
- Same sim resume blocking (no time advance without active tasks)
- Direct DB access (bypasses CLI parsing overhead but applies identical logic)
### Available Strategies
| Strategy | Selection Heuristic |
|----------|-------------------|
| `greedy` | Highest reward among accessible tasks |
| `random` | Random selection (deterministic via seeded RNG) |
| `throughput` | Highest reward per estimated completion hour |
| `prestige` | Phase 1 (prestige < 5): fastest prestige gain. Phase 2: throughput |
### Greedy Baseline Design
The greedy bot is the **"zero strategy" floor** that any competent LLM agent should beat:
- **Sequential execution**: 1 task at a time (`MAX_CONCURRENT_TASKS = 1`)
- **1 task accepted per turn**: Mirrors the LLM's effective pace (browse accept assign dispatch = ~1 task/turn)
- **All employees assigned**: Every employee works on the single active task
- **Prestige-aware browsing**: Filters market by `required_prestige <= floor(max_prestige)`, sorted by reward DESC
- **No completable filter**: All accessible tasks are candidates (blind to actual completion probability)
- **Tier-average rate estimation**: Uses `E[uniform(0, max_rate)]` per tier for ETA estimates (same information the LLM has)
- **Trust/prestige gating**: Respects the same acceptance requirements as the LLM
**Design choice**: The greedy bot is intentionally simple it has no workload management, no client strategy, no domain alignment, and no long-term planning. It picks the highest-paying task it can access and throws all resources at it. This makes it a reliable floor: if an LLM agent can't beat "always pick the biggest number," the agent isn't adding strategic value.
### Usage
```bash
# Single strategy/config/seed
uv run python scripts/bot_runner.py --bot greedy --config medium --seed 1
# All strategies × all configs × all seeds
uv run python scripts/bot_runner.py
```
Output is written to `results/yc_bench_result_{config}_{seed}_{bot_slug}.json` in the same format as LLM runs, enabling direct comparison in plots.
## Batch Running (`scripts/`)
### Multi-Seed Runs
Scripts for running the same model across multiple seeds:
```bash
# Run seeds 1-10 with claude-sonnet on medium difficulty
for seed in $(seq 1 10); do
uv run yc-bench run --model anthropic/claude-sonnet-4-20250514 --seed $seed --config medium
done
```
### Multi-Model Comparison
Scripts for comparing models on the same seeds:
```bash
for model in "anthropic/claude-sonnet-4-20250514" "openai/gpt-4o" "google/gemini-pro"; do
uv run yc-bench run --model $model --seed 42 --config medium
done
```
**Design choice**: Simple shell scripts rather than a complex orchestration framework. This keeps the benchmark tooling minimal and transparent.
## Results & Output
### Rollout JSON
Each run produces a rollout file:
```
results/
├── claude-sonnet_seed1_medium.json
├── claude-sonnet_seed2_medium.json
├── gpt-4o_seed1_medium.json
└── ...
```
### Rollout Contents
```json
{
"metadata": {
"model": "anthropic/claude-sonnet-4-20250514",
"seed": 1,
"config": "medium",
"start_time": "2025-03-15T10:00:00",
"end_time": "2025-03-15T10:45:00"
},
"outcome": "horizon_end",
"final_state": {
"funds_cents": 25000000,
"prestige": {"research": 7.2, "inference": 5.1, ...},
"tasks_completed": 24,
"tasks_failed": 3,
"tasks_cancelled": 1,
"turns_used": 187
},
"transcript": [
{"turn": 1, "action": "company status", "result": {...}},
...
]
}
```
### Plots (`plots/`)
Visualization scripts for comparing model performance:
- Funds over time
- Prestige progression per domain
- Task completion rates
- Comparison charts across models/seeds
**Design choice**: Separate plotting from the benchmark runner. Results are stored as data (JSON); visualization is a post-processing step.
## Error Recovery
### Crash Recovery
If a run crashes (LLM timeout, OOM, etc.):
- The SQLite database persists with the last consistent state
- Rollout JSON may be partial but includes transcript up to the crash
- Re-running with the same seed starts fresh (no resume from crash)
**Design choice**: No crash recovery by design. Benchmark runs should be atomic -- either complete or re-run. This prevents partial results from contaminating comparisons.
### Graceful Shutdown
On SIGINT (Ctrl+C):
- Current turn completes
- Partial rollout is saved
- Database is committed
- Dashboard is cleaned up
**Design choice**: Graceful shutdown preserves whatever data exists, useful for debugging long runs that need to be interrupted.