Add system design documentation for yc-bench

Comprehensive documentation covering all major subsystems: simulation engine, data models, task system, prestige, finances, employees, agent layer, CLI interface, configuration, and runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 12:58:03 +00:00 · 2026-03-08 13:42:41 -07:00 · 2026-03-08 13:42:41 -07:00 · ecd3d9e415
commit ecd3d9e415
parent b1cd7ebfb2
11 changed files with 1858 additions and 0 deletions
--- a/system_design/00_overview.md
+++ b/system_design/00_overview.md
@ -0,0 +1,98 @@
+# YC-Bench: System Overview
+
+## What is YC-Bench?
+
+YC-Bench is a **long-horizon deterministic benchmark for LLM agents**. It simulates an AI startup CEO managing a company over 1-3 years through a CLI-based interface against a SQLite-backed discrete-event simulation engine. The benchmark tests sustained decision-making over hundreds of turns through compounding financial, prestige, and deadline pressures.
+
+## Core Premise
+
+An LLM agent is dropped into the role of CEO of a small AI startup. It must:
+
+- Browse and accept tasks from a marketplace
+- Assign employees to tasks across 4 technical domains
+- Manage cash flow (payroll, rewards, penalties)
+- Build prestige in each domain to unlock higher-tier tasks
+- Survive until the simulation horizon ends without going bankrupt
+
+## Key Metrics (~4,975 lines of Python)
+
+| Dimension | Details |
+|-----------|---------|
+| Employees | 10 (hidden per-domain skill rates) |
+| Market Tasks | 200+ (configurable) |
+| Domains | 4: research, inference, data_environment, training |
+| Prestige Range | 1.0 - 10.0 per domain |
+| Difficulty Presets | tutorial, easy, medium, hard, nightmare |
+
+## High-Level Architecture
+
+```
+┌─────────────────────────────────────────────────────┐
+│                   Runner / CLI                       │
+│  (argument parsing, dashboard, session management)   │
+├─────────────────────────────────────────────────────┤
+│                   Agent Layer                        │
+│  (LLM runtime, agent loop, tools, prompt building)   │
+├─────────────────────────────────────────────────────┤
+│              CLI Command Interface                   │
+│  (company, employee, market, task, sim, finance,     │
+│   report, scratchpad)                                │
+├─────────────────────────────────────────────────────┤
+│              Simulation Engine (core/)               │
+│  (event processing, ETA solving, progress tracking,  │
+│   business time, prestige decay)                     │
+├─────────────────────────────────────────────────────┤
+│              Data Layer (db/)                        │
+│  (SQLAlchemy ORM models, session management)         │
+├─────────────────────────────────────────────────────┤
+│         Configuration & World Generation             │
+│  (Pydantic schemas, TOML presets, seeding, RNG)      │
+└─────────────────────────────────────────────────────┘
+```
+
+## Directory Map
+
+```
+~/yc_bench_fixed/
+├── src/yc_bench/
+│   ├── __main__.py              # CLI entry point
+│   ├── agent/                   # Agent runtime and loop
+│   ├── cli/                     # Agent-facing CLI commands
+│   ├── core/                    # Simulation engine
+│   ├── db/                      # ORM models & session
+│   ├── config/                  # Pydantic schemas + TOML presets
+│   ├── services/                # World generation & RNG
+│   └── runner/                  # Benchmark orchestration
+├── scripts/                     # Batch running scripts
+├── db/                          # SQLite databases (runtime)
+├── results/                     # Output JSON rollouts
+├── plots/                       # Result visualizations
+├── pyproject.toml               # Package definition (uv-based)
+└── uv.lock                     # Lock file
+```
+
+## Execution Flow
+
+1. User runs: `uv run yc-bench run --model <model> --seed 1 --config medium`
+2. Runner loads config, initializes DB, seeds world, starts agent loop
+3. Agent receives system prompt with company context and available CLI tools
+4. Each turn: agent calls CLI commands via `run_command` tool, optionally `python_repl`
+5. Agent calls `yc-bench sim resume` to advance simulation time
+6. Simulation processes events (completions, payroll, milestones) and returns wake events
+7. Loop continues until bankruptcy or horizon end
+8. Output: rollout JSON transcript + SQLite game state
+
+## Design Documents
+
+| File | Topic |
+|------|-------|
+| [01_simulation_engine.md](01_simulation_engine.md) | Core simulation engine and event processing |
+| [02_data_models.md](02_data_models.md) | Database schema and ORM design |
+| [03_task_system.md](03_task_system.md) | Task lifecycle, ETA, and progress |
+| [04_prestige_system.md](04_prestige_system.md) | Prestige mechanics, decay, and gating |
+| [05_financial_model.md](05_financial_model.md) | Funds, payroll, ledger, and bankruptcy |
+| [06_employee_model.md](06_employee_model.md) | Employee skills, throughput, and growth |
+| [07_agent_layer.md](07_agent_layer.md) | LLM runtime, agent loop, and tools |
+| [08_cli_interface.md](08_cli_interface.md) | CLI command groups and JSON output |
+| [09_configuration.md](09_configuration.md) | Config schema, presets, and world generation |
+| [10_runner_orchestration.md](10_runner_orchestration.md) | Benchmark runner, dashboard, and session |
--- a/system_design/01_simulation_engine.md
+++ b/system_design/01_simulation_engine.md
@ -0,0 +1,147 @@
+# Simulation Engine
+
+**Location**: `src/yc_bench/core/`
+
+## Design Choice: Discrete-Event Simulation
+
+YC-Bench uses a **discrete-event simulation (DES)** model rather than a tick-based approach. This was chosen because:
+
+1. **Determinism**: Events are processed in a fixed, reproducible order given the same seed
+2. **Efficiency**: Time jumps between events rather than iterating every hour/day
+3. **Clarity**: Each state change corresponds to a meaningful event, making the simulation auditable
+
+## Core Loop (`engine.py`)
+
+The `advance_time()` function is the heart of the simulation:
+
+```
+advance_time(session, company_id, cfg) → AdvanceResult
+```
+
+### Algorithm
+
+1. **Flush progress** on all active tasks (convert elapsed business hours into completed work)
+2. **Apply prestige decay** for elapsed days
+3. **Process payroll** if crossing a month boundary (first business day)
+4. **Fetch next unconsumed event** ordered by `(scheduled_at, priority)`
+5. **Dispatch to handler** based on event type
+6. **Recalculate ETAs** for affected tasks
+7. **Update sim_time** to the event's timestamp
+8. **Return wake events** to the agent
+
+### Why "Resume" Rather Than Auto-Advance?
+
+The agent explicitly calls `yc-bench sim resume` to advance time. This design:
+
+- Gives the agent control over pacing (plan before advancing)
+- Creates a natural decision checkpoint between simulation steps
+- Allows multiple CLI queries before committing to advancing
+- If the agent stalls (N turns without resume), the loop forces one automatically
+
+## Event System (`events.py`)
+
+### Event Types (Priority Order)
+
+| Priority | Event Type | Trigger |
+|----------|-----------|---------|
+| 1 | `task_completed` | Task reaches 100% in all domain requirements |
+| 2 | `bankruptcy` | Funds drop below zero after payroll |
+| 3 | `task_half` | Task reaches 50% progress milestone |
+| 4 | `horizon_end` | Simulation time limit reached |
+
+### Design Choice: Fixed Priority Ordering
+
+Events at the same timestamp are processed in strict priority order. This ensures:
+
+- Task completions (and their rewards) are processed before bankruptcy checks
+- A task finishing on the same day as payroll can save the company from bankruptcy
+- Deterministic behavior regardless of insertion order
+
+### Event Identity (Deterministic UUIDs)
+
+Event IDs use `uuid5` based on payload + timestamp + dedupe_key. This means:
+
+- Same world state produces identical event IDs
+- Deduplication is automatic (re-inserting same event is a no-op)
+- Full reproducibility across runs with same seed
+
+## Event Handlers (`handlers/`)
+
+### `task_complete.py`
+- Finalizes all domain progress to 100%
+- Success check: `sim_time <= deadline`
+- On success: add reward funds, add prestige per domain, boost employee skill rates, apply 1% salary bump
+- On failure (late): apply prestige penalty per domain (configurable multiplier)
+
+### `task_half.py`
+- Marks progress milestone reached
+- Informational event for agent awareness (no state changes beyond flag)
+
+### `bankruptcy.py`
+- Triggered when `funds_cents < 0` after payroll
+- Terminates the simulation with bankruptcy outcome
+
+### `horizon_end.py`
+- Triggered at configured simulation end date
+- Terminates the simulation with final scoring
+
+## Progress Tracking (`progress.py`)
+
+### Effective Rate Calculation
+
+```
+effective_rate = base_rate_per_hour / num_active_tasks_for_this_employee
+```
+
+**Design choice**: Throughput splitting creates a resource allocation puzzle. An employee assigned to 3 tasks works at 1/3 speed on each. The agent must balance parallelism vs. focus.
+
+### Progress Flush
+
+When `advance_time()` runs, it calculates work done since the last flush:
+
+```
+work = effective_rate × business_hours_elapsed
+completed_qty += work  (capped at required_qty)
+```
+
+## Business Time (`business_time.py`)
+
+### Design Choice: Business Hours Only
+
+Work only happens during business hours (weekdays, configurable hours per day). This adds:
+
+- Realistic scheduling constraints
+- Weekend gaps that affect deadline calculations
+- A reason for the agent to think about calendar timing
+
+## ETA Solver (`eta.py`)
+
+### Completion Time
+
+```
+solve_task_completion_time():
+  For each domain d:
+    remaining[d] = required_qty[d] - completed_qty[d]
+    rate[d] = sum(effective_rate for assigned employees with skill in d)
+    time[d] = remaining[d] / rate[d]
+  completion_time = max(time[d]) across all domains
+```
+
+### Design Choice: Multi-Domain Bottleneck
+
+A task completes when ALL domains finish. The slowest domain determines completion time. This creates interesting assignment puzzles where the agent must identify and address bottlenecks.
+
+### Halfway Time
+
+Used for progress milestone events. Calculated as the weighted midpoint across domains.
+
+## Prestige Decay
+
+```
+apply_prestige_decay(session, company_id, days_elapsed, cfg):
+  for each domain:
+    prestige -= decay_per_day × days_elapsed
+    prestige = max(prestige, prestige_min)  # floor at 1.0
+```
+
+**Design choice**: Decay prevents "set and forget" strategies. The agent must continuously work in domains to maintain access to high-tier tasks. Neglected domains revert to baseline.
--- a/system_design/02_data_models.md
+++ b/system_design/02_data_models.md
@ -0,0 +1,190 @@
+# Data Models & Database Design
+
+**Location**: `src/yc_bench/db/`
+
+## Design Choice: SQLAlchemy ORM with SQLite
+
+The benchmark uses SQLAlchemy's declarative ORM over SQLite for several reasons:
+
+1. **Single-file persistence**: SQLite stores the entire game state in one file, making runs portable and inspectable
+2. **Transactional safety**: ACID guarantees prevent partial state updates
+3. **Query flexibility**: SQL allows complex queries for financial reports, task filtering, etc.
+4. **Dual-backend support**: The same ORM works with PostgreSQL via `DATABASE_URL` environment variable for production/scaling scenarios
+
+## Schema Overview
+
+```
+┌──────────────┐     ┌───────────────────┐
+│   Company    │────<│  CompanyPrestige   │  (1 per domain × company)
+└──────┬───────┘     └───────────────────┘
+       │
+       ├────<┌──────────────┐     ┌──────────────────┐
+       │     │   Employee   │────<│ EmployeeSkillRate │  (1 per domain × employee)
+       │     └──────┬───────┘     └──────────────────┘
+       │            │
+       │            │    ┌────────────────┐
+       │            └───<│ TaskAssignment  │  (employee ↔ task junction)
+       │                 └────────┬───────┘
+       │                         │
+       ├────<┌──────────┐────────┘
+       │     │   Task   │────<┌─────────────────┐
+       │     └──────────┘     │ TaskRequirement  │  (1 per domain × task)
+       │                      └─────────────────┘
+       │
+       ├────<┌──────────────┐
+       │     │  SimEvent    │  (discrete events queue)
+       │     └──────────────┘
+       │
+       ├────<┌──────────────┐
+       │     │ LedgerEntry  │  (financial transactions)
+       │     └──────────────┘
+       │
+       ├────<┌──────────────┐
+       │     │  SimState    │  (simulation clock & counters)
+       │     └──────────────┘
+       │
+       └────<┌──────────────┐
+             │  Scratchpad  │  (agent persistent memory)
+             └──────────────┘
+```
+
+## Model Details
+
+### Company (`models/company.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `id` | UUID (PK) | Auto-generated |
+| `name` | String | Company name |
+| `funds_cents` | BigInteger | Financial balance in cents |
+
+**Design choice**: Funds stored in cents (integer) to avoid floating-point rounding errors in financial calculations. BigInteger supports very large/negative values.
+
+### CompanyPrestige (`models/company.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `company_id` | UUID (FK) | References Company |
+| `domain` | String | research / inference / data_environment / training |
+| `prestige_level` | Float | Range [1.0, 10.0] |
+
+**Design choice**: Prestige is tracked per-domain rather than as a single score. This forces specialization trade-offs and creates a 4-dimensional progression space.
+
+### Employee (`models/employee.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `id` | UUID (PK) | Auto-generated |
+| `company_id` | UUID (FK) | References Company |
+| `name` | String | Employee name |
+| `tier` | String | junior / mid / senior |
+| `work_hours_per_day` | Float | Hours available per business day |
+| `salary_cents` | BigInteger | Monthly salary in cents |
+
+### EmployeeSkillRate (`models/employee.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `employee_id` | UUID (FK) | References Employee |
+| `domain` | String | One of 4 domains |
+| `rate_domain_per_hour` | Float | Work units produced per hour |
+
+**Design choice**: Skill rates are **hidden from the agent**. The agent sees tier and salary but not per-domain effectiveness. This creates an information asymmetry puzzle -- the agent must infer employee strengths from task outcomes.
+
+### Task (`models/task.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `id` | UUID (PK) | Auto-generated |
+| `company_id` | UUID (FK, nullable) | NULL = market task, set on acceptance |
+| `status` | Enum | market → planned → active → completed_success / completed_fail / cancelled |
+| `title` | String | Task description |
+| `required_prestige` | Float | Minimum prestige needed in ALL task domains |
+| `reward_funds_cents` | BigInteger | Payment on successful completion |
+| `reward_prestige_delta` | Float | Prestige gained per domain on success |
+| `skill_boost_pct` | Float | Employee skill rate increase on success |
+| `accepted_at` | DateTime (nullable) | When task was accepted from market |
+| `deadline` | DateTime (nullable) | Calculated at acceptance |
+| `completed_at` | DateTime (nullable) | When task finished |
+| `success` | Boolean (nullable) | True = on-time, False = late |
+| `progress_milestone_pct` | Float | Tracks progress milestones (e.g., 50%) |
+
+**Design choice**: `company_id` being nullable elegantly distinguishes market tasks (available for browsing) from accepted tasks (owned by the company).
+
+### TaskRequirement (`models/task.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `task_id` | UUID (FK) | References Task |
+| `domain` | String | Which domain this requirement covers |
+| `required_qty` | Float | Total work units needed |
+| `completed_qty` | Float | Work units completed so far |
+
+**Design choice**: Multi-domain requirements make tasks a multi-dimensional optimization problem. A task might need work in 2-4 domains simultaneously.
+
+### TaskAssignment (`models/task.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `task_id` | UUID (FK) | References Task |
+| `employee_id` | UUID (FK) | References Employee |
+| `assigned_at` | DateTime | When assigned |
+
+**Design choice**: Many-to-many junction table. An employee can work on multiple tasks (throughput splits), and a task can have multiple employees (parallel progress).
+
+### SimEvent (`models/event.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `id` | UUID (PK) | Deterministic (uuid5) |
+| `company_id` | UUID (FK) | References Company |
+| `event_type` | String | task_completed / bankruptcy / task_half / horizon_end |
+| `scheduled_at` | DateTime | When event triggers |
+| `payload` | JSON | Event-specific data |
+| `dedupe_key` | String | Prevents duplicate events |
+| `consumed` | Boolean | True after processing |
+
+### LedgerEntry (`models/ledger.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `id` | UUID (PK) | Auto-generated |
+| `company_id` | UUID (FK) | References Company |
+| `occurred_at` | DateTime | Transaction timestamp |
+| `category` | Enum | MONTHLY_PAYROLL / TASK_REWARD / TASK_FAIL_PENALTY / TASK_CANCEL_PENALTY |
+| `amount_cents` | BigInteger | Signed amount (negative = cost) |
+| `ref_type` | String (nullable) | Reference entity type |
+| `ref_id` | UUID (nullable) | Reference entity ID |
+
+**Design choice**: Immutable append-only ledger provides a complete financial audit trail. No entries are ever deleted or modified.
+
+### SimState (`models/sim_state.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `company_id` | UUID (FK, PK) | References Company |
+| `sim_time` | DateTime | Current simulation clock |
+| `run_seed` | Integer | RNG seed for reproducibility |
+| `horizon_end` | DateTime | When simulation ends |
+| `replenish_counter` | Integer | Tracks market task replenishment |
+
+### Scratchpad (`models/scratchpad.py`)
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `company_id` | UUID (FK) | References Company |
+| `content` | Text | Free-form agent notes |
+
+**Design choice**: Scratchpad survives LLM context truncation, giving the agent persistent memory across the full simulation.
+
+## Session Management (`session.py`)
+
+```python
+session_scope(factory) → context manager
+```
+
+- Creates a scoped session with automatic commit/rollback
+- Supports both SQLite (default) and PostgreSQL (via `DATABASE_URL`)
+- `init_db()` creates all tables from ORM metadata
+
+**Design choice**: Context manager pattern ensures every database operation is properly transacted, preventing partial state updates that would corrupt the simulation.
--- a/system_design/03_task_system.md
+++ b/system_design/03_task_system.md
@ -0,0 +1,144 @@
+# Task System
+
+**Location**: `src/yc_bench/cli/task_commands.py`, `src/yc_bench/core/eta.py`, `src/yc_bench/core/progress.py`
+
+## Task Lifecycle
+
+```
+market ──accept──> planned ──dispatch──> active ──complete──> completed_success
+                      │                    │                  completed_fail
+                      │                    │
+                      └──cancel──> cancelled <──cancel──┘
+```
+
+### States
+
+| Status | Meaning |
+|--------|---------|
+| `market` | Available for browsing, not yet accepted |
+| `planned` | Accepted by company, employees can be assigned |
+| `active` | Dispatched, work is progressing |
+| `completed_success` | Finished on time |
+| `completed_fail` | Finished late (past deadline) |
+| `cancelled` | Abandoned by agent |
+
+## Design Choices
+
+### Two-Phase Activation (Accept → Dispatch)
+
+Tasks go through `planned` before `active`. This separation:
+
+1. **Allows pre-assignment**: Agent can assign employees before starting the clock
+2. **Deadline starts at accept**: Creates urgency -- planning time counts against the deadline
+3. **Forces commitment**: Accepting a task reserves it but the agent must still dispatch
+
+### Deadline Calculation
+
+```
+deadline = accepted_at + max(required_qty[d] for all domains d) / deadline_qty_per_day
+```
+
+**Design choice**: Deadline is proportional to the largest single-domain requirement, not the sum. This means multi-domain tasks don't get proportionally more time -- they require parallel work.
+
+### Prestige Gating at Accept Time
+
+```python
+def task_accept(task_id):
+    for domain in task.requirements:
+        if company_prestige[domain] < task.required_prestige:
+            reject("Insufficient prestige in {domain}")
+```
+
+**Design choice**: Prestige check is per-domain. A task requiring prestige 3.0 with requirements in `research` and `inference` needs prestige >= 3.0 in BOTH domains. This prevents gaming by maxing one domain.
+
+### Cancel Penalties
+
+Cancelling an active task incurs:
+- Prestige penalty: `reward_prestige_delta × cancel_multiplier` (configurable per difficulty)
+- No financial penalty (just lost opportunity)
+
+**Design choice**: Cancel penalties prevent the strategy of accepting everything and dropping what's inconvenient. Higher difficulties increase the cancel multiplier.
+
+## Employee Assignment
+
+### Assignment Rules
+
+- Employees can only be assigned to `planned` or `active` tasks
+- An employee can work on multiple tasks simultaneously (throughput splits)
+- Multiple employees can work on the same task (parallel progress)
+
+### Throughput Splitting
+
+```
+effective_rate = base_rate_per_hour / num_active_tasks
+```
+
+**Design choice**: Linear throughput splitting creates a fundamental trade-off:
+- **Focus**: 1 employee on 1 task = full speed
+- **Parallel**: 1 employee on 3 tasks = 1/3 speed each
+- The agent must decide between fast completion of few tasks vs. slow progress on many
+
+## Progress Tracking (`progress.py`)
+
+### How Work Gets Done
+
+Progress is calculated lazily during `advance_time()`:
+
+```python
+for each active task:
+    for each assigned employee:
+        for each domain in task requirements:
+            work = employee.skill_rate[domain] / num_active_tasks × business_hours
+            requirement.completed_qty += work
+            requirement.completed_qty = min(completed_qty, required_qty)
+```
+
+### Multi-Domain Completion
+
+A task is complete when ALL domain requirements reach `completed_qty >= required_qty`. The slowest domain is the bottleneck.
+
+**Design choice**: This creates interesting optimization puzzles. If a task needs 100 units of research and 50 units of training, the agent should allocate more research-skilled employees to balance completion times.
+
+## ETA Solver (`eta.py`)
+
+### Completion Time Calculation
+
+```python
+def solve_task_completion_time(task, assignments, sim_time):
+    for each domain d:
+        remaining = required_qty[d] - completed_qty[d]
+        rate = sum(effective_rate[emp][d] for emp in assignments)
+        if rate == 0:
+            return infinity  # no one can work on this domain
+        hours_needed[d] = remaining / rate
+
+    max_hours = max(hours_needed.values())
+    return sim_time + max_hours (in business hours)
+```
+
+### Halfway Time Calculation
+
+Used for milestone events. Finds the time when weighted average across domains reaches 50%.
+
+### When ETAs Are Recalculated
+
+- Task dispatched (new active task)
+- Employee assigned/unassigned
+- Task completed (frees employee throughput for other tasks)
+- Task cancelled (same)
+
+**Design choice**: Dynamic ETA recalculation ensures events are always accurate. When an employee is reassigned, all affected tasks get new completion projections.
+
+## Market Task Generation
+
+See [09_configuration.md](09_configuration.md) for details on how market tasks are generated with stratified prestige distribution and randomized requirements.
+
+### Browsing and Filtering
+
+The `market browse` command supports:
+- Domain filter
+- Prestige range filter
+- Reward range filter
+- Pagination (offset/limit)
+
+All output is JSON for agent consumption.
--- a/system_design/04_prestige_system.md
+++ b/system_design/04_prestige_system.md
@ -0,0 +1,123 @@
+# Prestige System
+
+**Location**: `src/yc_bench/db/models/company.py` (CompanyPrestige), `src/yc_bench/core/engine.py` (decay), `src/yc_bench/core/handlers/task_complete.py` (rewards/penalties)
+
+## Overview
+
+Prestige is YC-Bench's core progression mechanic. It controls access to higher-tier tasks (which offer better rewards) and decays over time, forcing continuous engagement.
+
+## Design Choices
+
+### Per-Domain Prestige (4 Independent Tracks)
+
+```
+research:          ████████░░  (8.0)
+inference:         ██████░░░░  (6.0)
+data_environment:  ███░░░░░░░  (3.0)
+training:          █████░░░░░  (5.0)
+```
+
+**Why 4 domains?** This creates a 4-dimensional strategic space:
+- The agent can't max all domains simultaneously (decay + limited employees)
+- Specialization unlocks high-tier tasks in 1-2 domains
+- Diversification provides resilience but slower progression
+- Multi-domain tasks require balanced prestige across their domains
+
+### Prestige Range: [1.0, 10.0]
+
+| Level | Meaning |
+|-------|---------|
+| 1.0 | Minimum (starting/decayed) |
+| 3.0-4.0 | Mid-tier tasks accessible |
+| 7.0-8.0 | High-tier tasks accessible |
+| 10.0 | Maximum (hard cap) |
+
+**Design choice**: The 1-10 range is intuitive and provides enough granularity for meaningful gating tiers without over-complicating the system.
+
+## Prestige Gain
+
+On successful task completion (on-time):
+
+```
+for each domain in task.requirements:
+    company_prestige[domain] += task.reward_prestige_delta
+    company_prestige[domain] = min(prestige, 10.0)  # cap
+```
+
+**Design choice**: Prestige gain is per-domain and tied to the task's requirements. Completing a research+inference task only boosts those two domains, not training or data_environment.
+
+### Prestige Scaling of Rewards
+
+```
+actual_reward = base_reward × (1 + reward_prestige_scale × (prestige - 1))
+```
+
+Higher prestige in a domain means better financial returns from tasks in that domain. This creates a virtuous cycle: more prestige → more money → more capacity → more prestige.
+
+## Prestige Loss
+
+### Decay (Daily)
+
+```
+prestige -= decay_per_day × days_elapsed
+prestige = max(prestige, 1.0)  # floor
+```
+
+Default decay rate: -0.005/day. This is slow enough to not punish short gaps but fast enough that inactive domains eventually return to baseline.
+
+**Design choice**: Continuous decay prevents "build once, exploit forever" strategies. The agent must continuously complete tasks in a domain to maintain access.
+
+### Failure Penalty
+
+On late task completion:
+
+```
+for each domain in task.requirements:
+    company_prestige[domain] -= task.reward_prestige_delta × fail_multiplier
+    company_prestige[domain] = max(prestige, 1.0)
+```
+
+Default `fail_multiplier`: 0.8. Late completion costs almost as much prestige as success would have gained.
+
+### Cancel Penalty
+
+On task cancellation:
+
+```
+for each domain in task.requirements:
+    company_prestige[domain] -= task.reward_prestige_delta × cancel_multiplier
+    company_prestige[domain] = max(prestige, 1.0)
+```
+
+Cancel multipliers vary by difficulty (higher on hard/nightmare).
+
+## Prestige Gating
+
+Tasks have a `required_prestige` field. At task acceptance:
+
+```python
+for domain in task.requirements:
+    if company_prestige[domain] < task.required_prestige:
+        reject()  # must meet prestige in ALL task domains
+```
+
+**Design choice**: Per-domain gating means a task with `required_prestige=5.0` and requirements in research + training needs prestige >= 5.0 in BOTH research AND training. This prevents gaming.
+
+### Stratified Market Tasks
+
+The first 10 market tasks are always prestige-1 (accessible immediately). Higher prestige tasks are introduced with stratified distribution. This ensures:
+
+- The agent always has something to work on initially
+- Progression is visible (new tasks unlock as prestige grows)
+- No dead-end states where the agent can't accept any task
+
+## Strategic Implications
+
+The prestige system creates several key strategic tensions:
+
+1. **Specialize vs. Diversify**: Focus on 1-2 domains for deep access, or spread across all 4?
+2. **Risk vs. Reward**: High-prestige tasks pay more but failure costs more prestige
+3. **Maintenance vs. Growth**: Should the agent keep working in mastered domains (maintenance) or push new ones (growth)?
+4. **Accept vs. Defer**: Taking a task you might fail risks prestige loss; waiting risks decay
+
+These tensions make the benchmark more than just "do tasks fast" -- it tests genuine strategic reasoning.
--- a/system_design/05_financial_model.md
+++ b/system_design/05_financial_model.md
@ -0,0 +1,162 @@
+# Financial Model
+
+**Location**: `src/yc_bench/db/models/ledger.py`, `src/yc_bench/cli/finance_commands.py`, `src/yc_bench/cli/report_commands.py`, `src/yc_bench/core/handlers/`
+
+## Overview
+
+The financial model simulates a startup's cash flow: revenue from completed tasks, costs from employee payroll, and penalties for failures. Running out of money triggers bankruptcy and ends the simulation.
+
+## Design Choices
+
+### Cents-Based Integer Arithmetic
+
+All financial values are stored as `BigInteger` in cents:
+
+```
+$1,000.00 = 100_000 cents
+```
+
+**Why cents?** Floating-point arithmetic introduces rounding errors that compound over hundreds of transactions. Integer cents guarantee exact financial accounting -- critical for a deterministic benchmark.
+
+### Immutable Append-Only Ledger
+
+Every financial transaction creates a `LedgerEntry` that is never modified or deleted:
+
+```python
+class LedgerEntry:
+    category: MONTHLY_PAYROLL | TASK_REWARD | TASK_FAIL_PENALTY | TASK_CANCEL_PENALTY
+    amount_cents: int  # negative for costs, positive for revenue
+    occurred_at: datetime
+    ref_type: str      # optional reference to source entity
+    ref_id: UUID       # optional reference ID
+```
+
+**Why immutable?** An append-only ledger provides:
+- Complete audit trail for debugging
+- Ability to reconstruct balance at any point in time
+- No risk of silent data corruption
+- Natural fit for the `finance ledger` and `report monthly` CLI commands
+
+## Revenue Sources
+
+### Task Rewards
+
+On successful (on-time) completion:
+
+```
+reward = base_reward × (1 + prestige_scale × (avg_prestige - 1))
+```
+
+Where `avg_prestige` is averaged across the task's required domains. Higher prestige = higher payouts.
+
+**Design choice**: Prestige-scaled rewards create a positive feedback loop that mirrors real business dynamics -- reputation leads to better opportunities.
+
+### Revenue Timing
+
+Rewards are credited immediately upon task completion (when the `task_completed` event fires with `success=True`).
+
+## Cost Sources
+
+### Monthly Payroll
+
+Payroll is deducted on the **first business day** of each month:
+
+```
+total_payroll = sum(employee.salary_cents for all employees)
+```
+
+**Design choice**: Monthly payroll creates predictable but unavoidable costs. The agent must maintain positive cash flow to cover it.
+
+### Salary Bumps
+
+Each completed task increases salaries:
+
+```
+for each assigned employee:
+    salary_cents *= 1.01  # 1% increase per completion
+```
+
+**Design choice**: Compounding salary increases mean success has a hidden cost. Long-running simulations see payroll grow substantially, creating late-game financial pressure even as task rewards scale with prestige.
+
+### Failure Penalties
+
+Late task completion incurs no direct financial penalty beyond the missed reward opportunity. However, the prestige loss from failure reduces future reward scaling.
+
+### Cancel Penalties
+
+Cancellation may incur a financial penalty depending on configuration (some presets charge a fraction of the reward).
+
+## Payroll-Event Tie-Breaking
+
+When payroll and events fall on the same timestamp:
+
+```
+Payroll is processed BEFORE events
+```
+
+**Design choice**: This ordering is critical. If a task completes on the same day as payroll:
+1. Payroll deducts first (may push funds negative)
+2. Task completion reward credits (may save from bankruptcy)
+3. Bankruptcy check happens after both
+
+This gives the agent the benefit of the doubt -- a task completing on payday can save the company.
+
+## Bankruptcy
+
+Bankruptcy triggers when `funds_cents < 0` after payroll processing:
+
+```python
+if company.funds_cents < 0:
+    insert_bankruptcy_event(session, company_id, sim_time)
+```
+
+**Design choice**: Bankruptcy is checked only after payroll (not after penalties). This simplifies the model and makes payroll the primary survival constraint.
+
+### Bankruptcy as Terminal State
+
+Once bankruptcy fires, the simulation ends. There is no recovery mechanic.
+
+**Why no bailout?** The benchmark tests whether the agent can sustainably manage a business. Allowing recovery would dilute this signal.
+
+## Financial Reports
+
+### Ledger Query (`finance ledger`)
+
+The agent can query the full transaction history with filters:
+- Category filter
+- Date range filter
+- Pagination
+
+### Monthly P&L (`report monthly`)
+
+Aggregates transactions by month:
+
+```
+Month     Revenue    Payroll    Penalties    Net
+2025-01   $50,000    $30,000    $0           $20,000
+2025-02   $35,000    $30,300    $5,000       -$300
+```
+
+**Design choice**: Structured financial reporting gives the agent the data it needs to make informed decisions about task selection and resource allocation.
+
+## Runway Calculation
+
+The `company status` command includes a runway estimate:
+
+```
+runway_months = funds_cents / monthly_payroll_cents
+```
+
+This helps the agent gauge urgency. Low runway signals that the agent needs profitable tasks quickly.
+
+## Difficulty Scaling
+
+Financial pressure scales with difficulty preset:
+
+| Preset | Initial Funds | Payroll Pressure | Penalties |
+|--------|--------------|-----------------|-----------|
+| tutorial | Very high | Low | Minimal |
+| easy | High | Moderate | Low |
+| medium | Moderate | Moderate | Standard |
+| hard | Low | High | 1.5x |
+| nightmare | Very low | Very high | 2x |
--- a/system_design/06_employee_model.md
+++ b/system_design/06_employee_model.md
@ -0,0 +1,143 @@
+# Employee Model
+
+**Location**: `src/yc_bench/db/models/employee.py`, `src/yc_bench/services/generate_employees.py`, `src/yc_bench/core/progress.py`
+
+## Overview
+
+Employees are the company's productive resources. Each has a tier, salary, and hidden per-domain skill rates. The agent must figure out who is good at what through observation and assign them optimally.
+
+## Design Choices
+
+### Hidden Skill Rates (Information Asymmetry)
+
+The agent sees:
+- Employee name, tier (junior/mid/senior), salary
+- Which tasks they're currently assigned to
+
+The agent does NOT see:
+- Per-domain skill rates (`rate_domain_per_hour`)
+- Actual work output per hour
+
+**Why hidden?** This is a core benchmark design decision:
+1. **Tests inference ability**: The agent must infer strengths from task completion patterns
+2. **Mirrors reality**: Real managers don't have exact productivity metrics for every skill dimension
+3. **Creates learning opportunity**: Early task assignments serve as "probes" to discover team capabilities
+4. **Rewards memory**: Agents that remember past performance can make better future assignments
+
+### Tier System
+
+| Tier | Typical Rate Range | Salary Range |
+|------|-------------------|--------------|
+| junior | Low | Low |
+| mid | Medium | Medium |
+| senior | High | High |
+
+**Design choice**: Tiers provide a rough signal. Seniors are generally better but not always in every domain. A junior might excel in one domain while a senior is mediocre there. The tier-salary correlation creates a cost-benefit trade-off.
+
+### Per-Domain Skill Rates
+
+Each employee has 4 skill rates (one per domain):
+
+```python
+class EmployeeSkillRate:
+    domain: str          # research, inference, data_environment, training
+    rate_domain_per_hour: float  # work units produced per business hour
+```
+
+Rates are generated from configurable distributions (triangular, beta, etc.) during world seeding. Some employees are specialists (high in one domain, low in others); some are generalists.
+
+**Design choice**: The 4-rate vector per employee creates a rich assignment optimization space. Optimal assignment requires matching employee strengths to task domain requirements.
+
+## Throughput Splitting
+
+When an employee works on multiple active tasks simultaneously:
+
+```
+effective_rate = base_rate / num_active_tasks
+```
+
+**Design choice**: Linear splitting (not diminishing returns or context-switching penalties) was chosen for simplicity and predictability. The agent can reason about it without hidden costs.
+
+### Example
+
+Employee Alice has `research_rate = 2.0/hr`:
+- Assigned to 1 task: contributes 2.0 research units/hour
+- Assigned to 3 tasks: contributes 0.67 research units/hour to each
+
+### Implication for Strategy
+
+The agent faces a fundamental trade-off:
+- **Focused assignment**: 1 employee → 1 task = fastest completion but no parallelism
+- **Spread assignment**: 1 employee → N tasks = slower per task but progress on multiple fronts
+- **Optimal**: Match the strategy to deadline pressure and task urgency
+
+## Skill Growth
+
+On successful task completion, assigned employees get a skill boost:
+
+```python
+for each assigned employee:
+    for each domain in task.requirements:
+        skill_rate[domain] *= (1 + task.skill_boost_pct / 100)
+```
+
+**Design choice**: Skill growth compounds over time. Early investments in employee development pay off later through faster task completion. This creates a "training vs. exploiting" tension.
+
+### Salary Bumps (Hidden Cost of Growth)
+
+Each task completion also increases salaries:
+
+```python
+for each assigned employee:
+    salary_cents *= 1.01  # 1% increase
+```
+
+**Design choice**: Salary bumps mean that experienced employees cost more. The agent can't infinitely scale employee productivity without also scaling costs. After many completions, payroll may become a significant burden.
+
+## Employee Generation (`generate_employees.py`)
+
+### Process
+
+1. Generate 10 employees per company (configurable)
+2. Assign tiers based on configured distribution (e.g., 30% junior, 40% mid, 30% senior)
+3. For each employee, generate 4 skill rates from per-tier distributions
+4. Set salary based on tier bracket
+
+### Distribution Types
+
+Skill rates are drawn from configurable distributions:
+- **Triangular**: min/mode/max (default -- creates realistic bell-curve-like distributions)
+- **Beta**: alpha/beta parameters (useful for skewed distributions)
+- **Normal**: mean/std (truncated to positive values)
+- **Uniform**: low/high
+- **Constant**: fixed value
+
+**Design choice**: Configurable distributions allow difficulty presets to create different workforce profiles. Tutorial mode might use tight distributions (predictable employees), while nightmare mode uses wide distributions (unpredictable).
+
+## Employee Visibility to Agent
+
+The `employee list` CLI command returns:
+
+```json
+{
+  "employees": [
+    {
+      "id": "uuid",
+      "name": "Alice Chen",
+      "tier": "senior",
+      "salary": "$8,000/mo",
+      "active_tasks": 2
+    }
+  ]
+}
+```
+
+Note: no skill rates, no per-domain breakdown, no historical performance. The agent must build this knowledge through experience.
+
+## Strategic Considerations
+
+1. **Discovery phase**: Early on, assign different employees to different domain tasks to learn strengths
+2. **Specialization**: Once strengths are known, match employees to their best domains
+3. **Load balancing**: Avoid overloading one employee (throughput splitting penalty)
+4. **Growth investment**: Assign employees to tasks in domains where they need improvement
+5. **Cost awareness**: Track which employees have had many salary bumps
--- a/system_design/07_agent_layer.md
+++ b/system_design/07_agent_layer.md
@ -0,0 +1,243 @@
+# Agent Layer
+
+**Location**: `src/yc_bench/agent/`
+
+## Overview
+
+The agent layer connects an LLM to the simulation via a tool-use interface. It manages the conversation loop, prompt construction, tool execution, and run state tracking.
+
+## Architecture
+
+```
+┌─────────────────────────┐
+│     Agent Loop          │
+│  (loop.py)              │
+├─────────────────────────┤
+│  ┌──────────┐ ┌──────┐ │
+│  │  Prompt   │ │ Tools │ │
+│  │ Builder   │ │      │ │
+│  └──────────┘ └──────┘ │
+├─────────────────────────┤
+│     LLM Runtime         │
+│  (runtime/)             │
+│  LiteLLM abstraction    │
+├─────────────────────────┤
+│  Run State / Transcript │
+│  (run_state.py)         │
+└─────────────────────────┘
+```
+
+## Design Choices
+
+### LiteLLM as LLM Abstraction (`runtime/`)
+
+The agent uses [LiteLLM](https://github.com/BerriAI/litellm) to abstract away vendor differences:
+
+```python
+# Supports: Anthropic, OpenAI, OpenRouter, Google Gemini, etc.
+response = litellm.completion(
+    model="anthropic/claude-sonnet-4-20250514",
+    messages=messages,
+    tools=tools,
+)
+```
+
+**Why LiteLLM?**
+- Single interface for all major LLM providers
+- Consistent tool-use format across providers
+- Easy to benchmark different models on the same scenarios
+- Handles auth, retries, and format conversion
+
+### Tool-Use Interface (Not Text Parsing)
+
+The agent interacts via structured tool calls, not text command parsing:
+
+```json
+{
+  "name": "run_command",
+  "arguments": {
+    "command": "yc-bench task list --status active"
+  }
+}
+```
+
+**Why tool-use?**
+- Eliminates parsing ambiguity
+- Works with all modern LLMs' native tool-use
+- Structured output from CLI commands (JSON) flows cleanly back
+- Reduces error rate vs. free-text command generation
+
+### Available Tools
+
+#### `run_command`
+Executes CLI commands in a subprocess. The agent can run any `yc-bench` CLI command.
+
+```python
+def run_command(command: str) -> str:
+    """Execute a yc-bench CLI command and return output."""
+```
+
+**Design choice**: Subprocess execution provides isolation. The agent can't accidentally modify simulation state outside of defined CLI commands.
+
+#### `python_repl` (Optional)
+A persistent Python interpreter for calculations and data analysis.
+
+```python
+def python_repl(code: str) -> str:
+    """Execute Python code and return output."""
+```
+
+**Design choice**: Some agents benefit from being able to compute (e.g., calculate optimal assignments, project cash flow). This tool is optional and configurable.
+
+## Agent Loop (`loop.py`)
+
+### Main Loop
+
+```python
+def run_agent_loop(runtime, session, company_id, cfg):
+    while not terminal:
+        # Build messages (system prompt + history)
+        messages = build_messages(history, context)
+
+        # Call LLM
+        response = runtime.completion(messages, tools)
+
+        # Process tool calls
+        for tool_call in response.tool_calls:
+            result = execute_tool(tool_call)
+            history.append(tool_call, result)
+
+        # Check for terminal conditions
+        if is_terminal(result):
+            break
+
+        # Auto-resume if agent hasn't advanced simulation
+        if turns_since_resume > max_turns_without_resume:
+            force_resume()
+```
+
+### Design Choices in the Loop
+
+#### History Truncation
+
+```python
+# Keep only last N turns to fit context window
+messages = system_prompt + history[-max_history_turns:]
+```
+
+**Why truncate?** Long simulations generate hundreds of turns. Without truncation, the context would exceed any model's window. The scratchpad CLI command compensates for lost history.
+
+#### Auto-Resume Forcing
+
+If the agent doesn't call `yc-bench sim resume` for N turns, the loop forces one:
+
+```python
+if turns_since_resume > cfg.loop.max_turns_without_resume:
+    result = execute("yc-bench sim resume")
+```
+
+**Why force?** Some models get stuck in analysis loops, repeatedly querying state without advancing. Auto-resume prevents infinite loops and ensures forward progress.
+
+#### Turn Budget
+
+The loop has a maximum turn count. This prevents runaway agents and bounds benchmark cost.
+
+## Prompt Construction (`prompt.py`)
+
+### System Prompt Structure
+
+```
+1. Role description ("You are the CEO of an AI startup...")
+2. Available commands reference
+3. Current company status summary
+4. Strategic guidance (domain, prestige, deadlines)
+5. Constraints and rules
+```
+
+**Design choice**: The system prompt provides enough context for the agent to understand its role without revealing internal mechanics (like hidden skill rates or exact formulas).
+
+### Context Building
+
+Each turn, the prompt may include:
+- Wake events from the last `sim resume`
+- Current funds and runway
+- Active task count and approaching deadlines
+- Prestige levels
+
+This contextual information helps the agent make informed decisions without needing to query every turn.
+
+## Run State (`run_state.py`)
+
+### Transcript Recording
+
+Every turn is recorded:
+
+```python
+{
+    "turn": 42,
+    "messages": [...],
+    "tool_calls": [...],
+    "tool_results": [...],
+    "timestamp": "2025-03-15T10:30:00",
+    "tokens_used": 1500
+}
+```
+
+**Design choice**: Full transcripts enable:
+- Post-hoc analysis of agent strategy
+- Debugging agent failures
+- Benchmark scoring based on decision quality
+- Comparison across models
+
+### Output Format
+
+The final rollout is saved as JSON:
+
+```json
+{
+    "model": "anthropic/claude-sonnet-4-20250514",
+    "seed": 42,
+    "config": "medium",
+    "outcome": "horizon_end",
+    "final_funds": 250000,
+    "final_prestige": {"research": 7.2, ...},
+    "turns": 187,
+    "transcript": [...]
+}
+```
+
+## Command Execution Policy (`commands/`)
+
+### Command Allowlist
+
+The agent can only execute `yc-bench` CLI commands. Arbitrary shell commands are blocked.
+
+**Design choice**: Restricting to the CLI API ensures:
+- No direct database manipulation
+- No simulation state bypass
+- Fair comparison across models
+- Deterministic state transitions
+
+### Error Handling
+
+Invalid commands return structured error messages:
+
+```json
+{"error": "Task not found", "task_id": "..."}
+```
+
+**Design choice**: Structured errors help the agent understand and recover from mistakes, rather than receiving opaque stack traces.
+
+## Retry and Timeout Logic
+
+```python
+# Exponential backoff for LLM API calls
+for attempt in range(max_retries):
+    try:
+        response = runtime.completion(messages, tools)
+        break
+    except RateLimitError:
+        wait(2 ** attempt)
+```
+
+**Design choice**: LLM APIs are unreliable. Retry logic ensures transient failures don't corrupt benchmark runs.
--- a/system_design/08_cli_interface.md
+++ b/system_design/08_cli_interface.md
@ -0,0 +1,173 @@
+# CLI Interface
+
+**Location**: `src/yc_bench/cli/`
+
+## Overview
+
+The CLI is the agent's sole interface to the simulation. Every command returns structured JSON, enabling reliable parsing by LLMs.
+
+## Design Choices
+
+### JSON-Only Output
+
+All CLI commands return JSON, never free-text:
+
+```bash
+$ yc-bench company status
+{
+  "company_name": "Nexus AI",
+  "funds": "$150,000.00",
+  "funds_cents": 15000000,
+  "monthly_payroll": "$30,000.00",
+  "runway_months": 5.0,
+  "prestige": {
+    "research": 3.5,
+    "inference": 2.1,
+    "data_environment": 1.0,
+    "training": 4.2
+  }
+}
+```
+
+**Why JSON?**
+- Unambiguous parsing by LLMs (vs. formatted tables)
+- Consistent structure across all commands
+- Easy to pipe into `python_repl` for analysis
+- Machine-readable without regex or text parsing
+
+### Command Group Organization
+
+| Group | File | Purpose |
+|-------|------|---------|
+| `company` | `company_commands.py` | Company status, prestige overview |
+| `employee` | `employee_commands.py` | Employee listing and details |
+| `market` | `market_commands.py` | Browse available tasks |
+| `task` | `task_commands.py` | Task lifecycle (accept/assign/dispatch/cancel/inspect/list) |
+| `sim` | `sim_commands.py` | Simulation control (resume) |
+| `finance` | `finance_commands.py` | Ledger queries |
+| `report` | `report_commands.py` | Monthly P&L reports |
+| `scratchpad` | `scratchpad_commands.py` | Persistent agent memory |
+
+**Design choice**: Command groups mirror real business functions (operations, HR, finance, strategy). This makes the interface intuitive for LLM agents that have been trained on business concepts.
+
+## Command Details
+
+### Company Commands
+
+#### `company status`
+Returns current funds, payroll, runway, and prestige levels per domain.
+
+**Design choice**: Single command gives the agent a complete financial and strategic snapshot. Reduces the number of API calls needed per decision cycle.
+
+### Employee Commands
+
+#### `employee list`
+Returns all employees with tier, salary, and current active task count.
+
+**Design choice**: Shows active task count but NOT skill rates. The agent must infer capabilities.
+
+### Market Commands
+
+#### `market browse [--domain X] [--min-prestige N] [--max-prestige N] [--offset O] [--limit L]`
+Browse available market tasks with optional filters.
+
+**Design choice**: Filtering and pagination prevent information overload. The agent can focus on tasks matching its current prestige level and strategic goals.
+
+### Task Commands
+
+#### `task accept <task_id>`
+Accept a market task. Validates prestige requirements. Sets deadline.
+
+#### `task assign <task_id> <employee_id>`
+Assign an employee to a planned/active task. Recalculates ETAs.
+
+#### `task dispatch <task_id>`
+Start work on a planned task. Changes status to active.
+
+#### `task cancel <task_id>`
+Cancel a task. Applies prestige penalty. Frees employees.
+
+#### `task inspect <task_id>`
+Detailed view of a single task: requirements, progress, assignments, deadline.
+
+#### `task list [--status X]`
+List company tasks with optional status filter.
+
+**Design choice**: The accept → assign → dispatch flow gives the agent explicit control over each phase. This mirrors real project management where you scope, staff, and then kick off work.
+
+### Simulation Commands
+
+#### `sim resume`
+Advance simulation to the next event. Returns wake events.
+
+```json
+{
+  "advanced_to": "2025-02-15T09:00:00",
+  "wake_events": [
+    {"type": "task_completed", "task_id": "...", "success": true},
+    {"type": "payroll", "amount": -3000000}
+  ]
+}
+```
+
+**Design choice**: Resume is the only way to advance time. The agent explicitly chooses when to move forward, creating natural decision checkpoints.
+
+### Finance Commands
+
+#### `finance ledger [--category X] [--from DATE] [--to DATE] [--offset O] [--limit L]`
+Query the immutable transaction history.
+
+**Design choice**: Full ledger access lets sophisticated agents analyze spending patterns and project future cash flow.
+
+### Report Commands
+
+#### `report monthly`
+Aggregated P&L by month.
+
+**Design choice**: Monthly reports provide a higher-level financial view than raw ledger entries, useful for strategic planning.
+
+### Scratchpad Commands
+
+#### `scratchpad read`
+Read persistent notes.
+
+#### `scratchpad write <content>`
+Overwrite scratchpad contents.
+
+#### `scratchpad append <content>`
+Add to existing scratchpad.
+
+#### `scratchpad clear`
+Clear scratchpad.
+
+**Design choice**: The scratchpad is critical for long simulations where LLM context gets truncated. The agent can store:
+- Employee capability observations
+- Strategic plans
+- Financial projections
+- Task priority lists
+
+This compensates for context window limitations and tests whether the agent proactively maintains external memory.
+
+## Error Handling
+
+All commands return structured errors:
+
+```json
+{
+  "error": "Insufficient prestige in research (have 2.3, need 4.0)"
+}
+```
+
+**Design choice**: Descriptive error messages help the agent understand what went wrong and adjust its strategy, rather than failing silently or with cryptic messages.
+
+## CLI Entry Point (`__main__.py`)
+
+The CLI uses a command-line parser (likely Click or argparse) to route commands to handler functions. Each handler:
+
+1. Opens a database session
+2. Validates inputs
+3. Performs the operation
+4. Returns JSON output
+5. Commits or rolls back the transaction
+
+**Design choice**: Each CLI call is a self-contained transaction. This prevents partial state updates and ensures the simulation remains consistent.
--- a/system_design/09_configuration.md
+++ b/system_design/09_configuration.md
@ -0,0 +1,203 @@
+# Configuration System
+
+**Location**: `src/yc_bench/config/`
+
+## Overview
+
+The configuration system uses Pydantic models validated from TOML preset files. It controls every aspect of the simulation: world generation parameters, difficulty tuning, agent behavior, and distribution specifications.
+
+## Design Choices
+
+### Pydantic Schema (`schema.py`)
+
+The configuration hierarchy:
+
+```
+ExperimentConfig
+├── AgentConfig          # LLM model, tools, retry settings
+├── LoopConfig           # Turn budget, auto-resume threshold
+├── SimConfig            # Simulation parameters
+└── WorldConfig          # World generation parameters
+    ├── CompanyConfig     # Initial funds, starting prestige
+    ├── EmployeeConfig    # Team size, tier distribution, salary ranges
+    ├── TaskConfig        # Task count, domain requirements, deadlines
+    └── PrestigeConfig    # Decay rate, penalty multipliers, scaling
+```
+
+**Why Pydantic?**
+- Type validation at load time (catch config errors early)
+- Default values with optional overrides
+- Discriminated unions for distribution specs
+- Clear documentation through type annotations
+- Serialization to/from TOML/JSON
+
+### TOML Preset Files (`presets/`)
+
+```toml
+# medium.toml
+[world]
+initial_funds_cents = 500_000_00
+
+[world.prestige]
+decay_per_day = 0.005
+penalty_fail_multiplier = 0.8
+penalty_cancel_multiplier = 1.0
+
+[world.tasks]
+count = 200
+deadline_qty_per_day = 11.0
+
+[world.tasks.reward_funds]
+type = "triangular"
+min = 5000_00
+mode = 15000_00
+max = 50000_00
+```
+
+**Why TOML?** Human-readable, supports comments, natural hierarchy via sections, widely supported in Python. Better than JSON for config files (comments), simpler than YAML (fewer gotchas).
+
+### Preset Hierarchy
+
+| Preset | Focus | Key Characteristics |
+|--------|-------|-------------------|
+| `default.toml` | Base | All defaults; other presets override selectively |
+| `tutorial.toml` | Learning | Relaxed deadlines, prestige-1 tasks only, high funds |
+| `easy.toml` | Casual | Relaxed deadlines, flat prestige requirements |
+| `medium.toml` | Standard | Prestige climbing, 2-domain tasks, 9-day deadlines |
+| `hard.toml` | Challenge | Prestige gating active, 7-day deadlines, 1.5x cancel penalty |
+| `nightmare.toml` | Extreme | Razor-thin margins, 6-day deadlines, 2x penalties |
+
+**Design choice**: Preset-based difficulty rather than a single "difficulty slider" allows fine-grained control. Each preset can tune dozens of independent parameters.
+
+### Config Loading (`loader.py`)
+
+```python
+def load_config(preset_name: str) -> ExperimentConfig:
+    base = load_toml("default.toml")
+    overlay = load_toml(f"{preset_name}.toml")
+    merged = deep_merge(base, overlay)
+    return ExperimentConfig(**merged)
+```
+
+**Design choice**: Config inheritance via deep merge. Presets only specify what differs from default, keeping preset files concise and maintainable.
+
+## Distribution Specifications (`sampling.py`)
+
+### The DistSpec System
+
+Many world generation parameters use statistical distributions rather than fixed values:
+
+```python
+class DistSpec(BaseModel):
+    """Discriminated union of distribution types."""
+    type: Literal["triangular", "beta", "normal", "uniform", "constant"]
+    # Parameters vary by type
+```
+
+**Supported distributions:**
+
+| Type | Parameters | Use Case |
+|------|-----------|----------|
+| `triangular` | min, mode, max | Task rewards, skill rates (natural asymmetric bell curve) |
+| `beta` | alpha, beta, scale | Prestige requirements (skewed toward low values) |
+| `normal` | mean, std | Symmetric variation around a target |
+| `uniform` | low, high | Equal probability across range |
+| `constant` | value | Fixed value (no randomness) |
+
+**Why discriminated unions?** Pydantic validates the correct parameters for each distribution type at load time. Invalid combinations (e.g., triangular with alpha parameter) are caught before the simulation runs.
+
+### Usage Example
+
+```toml
+[world.tasks.reward_funds]
+type = "triangular"
+min = 5000_00
+mode = 15000_00
+max = 50000_00
+
+[world.employees.junior_rate]
+type = "beta"
+alpha = 2.0
+beta = 5.0
+scale = 3.0
+```
+
+## World Generation
+
+### Seeding (`services/seed_world.py`)
+
+```python
+def seed_world_transactional(session, cfg, seed):
+    rng = create_rng(seed)
+    company = create_company(session, cfg.world.company)
+    employees = generate_employees(session, company, cfg.world.employees, rng)
+    tasks = generate_tasks(session, cfg.world.tasks, rng)
+    sim_state = create_sim_state(session, company, cfg.sim, seed)
+```
+
+**Design choice**: Single-transaction world seeding ensures atomic creation. Either the entire world is created or nothing is -- no partial states.
+
+### Employee Generation (`services/generate_employees.py`)
+
+1. Generate N employees (default 10)
+2. Assign tiers from configured distribution (e.g., 30/40/30 junior/mid/senior)
+3. For each employee, sample 4 skill rates from per-tier distributions
+4. Set salary based on tier range
+
+### Task Generation (`services/generate_tasks.py`)
+
+1. Generate M tasks (default 200+)
+2. First 10 tasks are always prestige-1 (guaranteed accessible)
+3. Remaining tasks have stratified prestige requirements
+4. Each task gets 2-4 domain requirements sampled from distributions
+5. Rewards scale with prestige and task size
+
+**Design choice**: Stratified generation ensures:
+- The agent always has starting tasks (prestige-1 guaranteed)
+- Tasks span the full prestige range (progression is possible)
+- No prestige "dead zones" where no tasks exist
+
+### RNG Management (`services/rng.py`)
+
+```python
+def create_rng(seed: int) -> numpy.random.Generator:
+    return numpy.random.default_rng(seed)
+```
+
+**Design choice**: Centralized RNG with explicit seed ensures full reproducibility. Same seed → same world → same event sequence (given same agent actions).
+
+## Key Configuration Parameters
+
+### Financial Tuning
+
+| Parameter | Default | Effect |
+|-----------|---------|--------|
+| `initial_funds_cents` | 500,000 | Starting capital |
+| `reward_prestige_scale` | 0.15 | How much prestige amplifies rewards |
+| `salary_bump_pct` | 1.0 | Per-completion salary increase |
+
+### Prestige Tuning
+
+| Parameter | Default | Effect |
+|-----------|---------|--------|
+| `prestige_decay_per_day` | 0.005 | Daily prestige loss |
+| `penalty_fail_multiplier` | 0.8 | Prestige cost of late completion |
+| `penalty_cancel_multiplier` | 1.0 | Prestige cost of cancellation |
+| `prestige_min` | 1.0 | Floor value |
+| `prestige_max` | 10.0 | Ceiling value |
+
+### Task Tuning
+
+| Parameter | Default | Effect |
+|-----------|---------|--------|
+| `deadline_qty_per_day` | 11.0 | Deadline generosity |
+| `num_domains_per_task` | 2-4 | Multi-domain complexity |
+| `progress_milestone_pct` | 50 | When to fire halfway event |
+
+### Agent Tuning
+
+| Parameter | Default | Effect |
+|-----------|---------|--------|
+| `max_turns` | 500 | Hard turn limit |
+| `max_turns_without_resume` | 5 | Auto-resume threshold |
+| `history_truncation` | 50 | Turns kept in context |
--- a/system_design/10_runner_orchestration.md
+++ b/system_design/10_runner_orchestration.md
@ -0,0 +1,232 @@
+# Runner & Orchestration
+
+**Location**: `src/yc_bench/runner/`
+
+## Overview
+
+The runner is the top-level orchestration layer that ties everything together: parsing arguments, loading configuration, initializing the database, seeding the world, starting the agent loop, and collecting results.
+
+## Components
+
+### Entry Point (`main.py`)
+
+```python
+def run_benchmark(args):
+    # 1. Load configuration
+    cfg = load_config(args.config)
+
+    # 2. Initialize database
+    engine, factory = init_db(db_path)
+
+    # 3. Seed world
+    with session_scope(factory) as session:
+        seed_world_transactional(session, cfg, args.seed)
+
+    # 4. Build agent runtime
+    runtime = build_runtime(cfg.agent, args.model)
+
+    # 5. Start dashboard (if TTY)
+    dashboard = Dashboard(cfg) if is_tty() else None
+
+    # 6. Run agent loop
+    result = run_agent_loop(runtime, factory, cfg, dashboard)
+
+    # 7. Save results
+    save_rollout(result, args.output)
+```
+
+### Design Choices
+
+#### Single-Command Invocation
+
+```bash
+uv run yc-bench run --model gemini/gemini-3-flash --seed 1 --config medium
+```
+
+**Why single command?** Benchmarks should be easy to reproduce. One command with explicit parameters (model, seed, config) fully specifies a run.
+
+#### Database Per Run
+
+Each run creates a fresh SQLite database:
+
+```
+db/run_seed1_medium_2025-03-15.sqlite
+```
+
+**Why per-run databases?**
+- Isolation: runs can't interfere with each other
+- Inspection: can analyze any run's final state after the fact
+- Reproducibility: re-running with same seed produces identical database
+- Parallelism: multiple runs can execute simultaneously
+
+## Argument Parsing (`args.py`)
+
+### Key Arguments
+
+| Argument | Required | Description |
+|----------|----------|-------------|
+| `--model` | Yes | LLM model identifier (LiteLLM format) |
+| `--seed` | Yes | Random seed for world generation |
+| `--config` | No | Difficulty preset (default: "medium") |
+| `--output` | No | Output path for rollout JSON |
+| `--no-dashboard` | No | Disable live terminal UI |
+| `--max-turns` | No | Override turn limit |
+
+**Design choice**: Required arguments are minimal (model + seed). Everything else has sensible defaults. This reduces barrier to running benchmarks while allowing full customization.
+
+## Dashboard (`dashboard.py`)
+
+### Live Terminal UI
+
+The dashboard uses [Rich](https://github.com/Textualize/rich) to display real-time simulation state:
+
+```
+┌─ YC-Bench Dashboard ──────────────────────────────┐
+│ Model: claude-sonnet-4  Seed: 42  Config: medium  │
+│ Turn: 87/500  Sim Time: 2025-06-15                 │
+├────────────────────────────────────────────────────┤
+│ Funds: $125,340  Runway: 4.2 months                │
+│ Prestige: R:5.2  I:3.8  D:2.1  T:6.4              │
+│ Active Tasks: 3  Completed: 12  Failed: 1          │
+├────────────────────────────────────────────────────┤
+│ Last Action: task assign abc123 emp456              │
+│ Last Event: task_completed (success)               │
+└────────────────────────────────────────────────────┘
+```
+
+**Design choice**: The dashboard is for human observers, not the agent. It provides real-time visibility into benchmark runs without affecting agent behavior.
+
+### Features
+
+- Live fund tracking with trend indicators
+- Prestige levels per domain
+- Task status counters
+- Recent agent actions
+- Turn counter and simulation clock
+- Auto-refreshes on each turn
+
+### Conditional Activation
+
+Dashboard only activates when running in a TTY (interactive terminal). Redirected output or CI environments get plain log output.
+
+**Why conditional?** Batch runs (scripts/) shouldn't have terminal UI overhead. Detecting TTY ensures the right output mode automatically.
+
+## Session Management (`session.py`)
+
+### Run Session
+
+Manages the lifecycle of a single benchmark run:
+
+```python
+class RunSession:
+    db_path: str
+    config: ExperimentConfig
+    model: str
+    seed: int
+    start_time: datetime
+
+    def save_rollout(self, result):
+        """Save final rollout JSON to results/"""
+
+    def cleanup(self):
+        """Clean up temporary resources"""
+```
+
+**Design choice**: Session object encapsulates all run-specific state, making it easy to serialize and manage runs.
+
+## Batch Running (`scripts/`)
+
+### Multi-Seed Runs
+
+Scripts for running the same model across multiple seeds:
+
+```bash
+# Run seeds 1-10 with claude-sonnet on medium difficulty
+for seed in $(seq 1 10); do
+    uv run yc-bench run --model anthropic/claude-sonnet-4-20250514 --seed $seed --config medium
+done
+```
+
+### Multi-Model Comparison
+
+Scripts for comparing models on the same seeds:
+
+```bash
+for model in "anthropic/claude-sonnet-4-20250514" "openai/gpt-4o" "google/gemini-pro"; do
+    uv run yc-bench run --model $model --seed 42 --config medium
+done
+```
+
+**Design choice**: Simple shell scripts rather than a complex orchestration framework. This keeps the benchmark tooling minimal and transparent.
+
+## Results & Output
+
+### Rollout JSON
+
+Each run produces a rollout file:
+
+```
+results/
+├── claude-sonnet_seed1_medium.json
+├── claude-sonnet_seed2_medium.json
+├── gpt-4o_seed1_medium.json
+└── ...
+```
+
+### Rollout Contents
+
+```json
+{
+    "metadata": {
+        "model": "anthropic/claude-sonnet-4-20250514",
+        "seed": 1,
+        "config": "medium",
+        "start_time": "2025-03-15T10:00:00",
+        "end_time": "2025-03-15T10:45:00"
+    },
+    "outcome": "horizon_end",
+    "final_state": {
+        "funds_cents": 25000000,
+        "prestige": {"research": 7.2, "inference": 5.1, ...},
+        "tasks_completed": 24,
+        "tasks_failed": 3,
+        "tasks_cancelled": 1,
+        "turns_used": 187
+    },
+    "transcript": [
+        {"turn": 1, "action": "company status", "result": {...}},
+        ...
+    ]
+}
+```
+
+### Plots (`plots/`)
+
+Visualization scripts for comparing model performance:
+- Funds over time
+- Prestige progression per domain
+- Task completion rates
+- Comparison charts across models/seeds
+
+**Design choice**: Separate plotting from the benchmark runner. Results are stored as data (JSON); visualization is a post-processing step.
+
+## Error Recovery
+
+### Crash Recovery
+
+If a run crashes (LLM timeout, OOM, etc.):
+- The SQLite database persists with the last consistent state
+- Rollout JSON may be partial but includes transcript up to the crash
+- Re-running with the same seed starts fresh (no resume from crash)
+
+**Design choice**: No crash recovery by design. Benchmark runs should be atomic -- either complete or re-run. This prevents partial results from contaminating comparisons.
+
+### Graceful Shutdown
+
+On SIGINT (Ctrl+C):
+- Current turn completes
+- Partial rollout is saved
+- Database is committed
+- Dashboard is cleaned up
+
+**Design choice**: Graceful shutdown preserves whatever data exists, useful for debugging long runs that need to be interrupted.