diff --git a/system_design/00_overview.md b/system_design/00_overview.md new file mode 100644 index 0000000..1d1a7d0 --- /dev/null +++ b/system_design/00_overview.md @@ -0,0 +1,98 @@ +# YC-Bench: System Overview + +## What is YC-Bench? + +YC-Bench is a **long-horizon deterministic benchmark for LLM agents**. It simulates an AI startup CEO managing a company over 1-3 years through a CLI-based interface against a SQLite-backed discrete-event simulation engine. The benchmark tests sustained decision-making over hundreds of turns through compounding financial, prestige, and deadline pressures. + +## Core Premise + +An LLM agent is dropped into the role of CEO of a small AI startup. It must: + +- Browse and accept tasks from a marketplace +- Assign employees to tasks across 4 technical domains +- Manage cash flow (payroll, rewards, penalties) +- Build prestige in each domain to unlock higher-tier tasks +- Survive until the simulation horizon ends without going bankrupt + +## Key Metrics (~4,975 lines of Python) + +| Dimension | Details | +|-----------|---------| +| Employees | 10 (hidden per-domain skill rates) | +| Market Tasks | 200+ (configurable) | +| Domains | 4: research, inference, data_environment, training | +| Prestige Range | 1.0 - 10.0 per domain | +| Difficulty Presets | tutorial, easy, medium, hard, nightmare | + +## High-Level Architecture + +``` +┌─────────────────────────────────────────────────────┐ +│ Runner / CLI │ +│ (argument parsing, dashboard, session management) │ +├─────────────────────────────────────────────────────┤ +│ Agent Layer │ +│ (LLM runtime, agent loop, tools, prompt building) │ +├─────────────────────────────────────────────────────┤ +│ CLI Command Interface │ +│ (company, employee, market, task, sim, finance, │ +│ report, scratchpad) │ +├─────────────────────────────────────────────────────┤ +│ Simulation Engine (core/) │ +│ (event processing, ETA solving, progress tracking, │ +│ business time, prestige decay) │ +├─────────────────────────────────────────────────────┤ +│ Data Layer (db/) │ +│ (SQLAlchemy ORM models, session management) │ +├─────────────────────────────────────────────────────┤ +│ Configuration & World Generation │ +│ (Pydantic schemas, TOML presets, seeding, RNG) │ +└─────────────────────────────────────────────────────┘ +``` + +## Directory Map + +``` +~/yc_bench_fixed/ +├── src/yc_bench/ +│ ├── __main__.py # CLI entry point +│ ├── agent/ # Agent runtime and loop +│ ├── cli/ # Agent-facing CLI commands +│ ├── core/ # Simulation engine +│ ├── db/ # ORM models & session +│ ├── config/ # Pydantic schemas + TOML presets +│ ├── services/ # World generation & RNG +│ └── runner/ # Benchmark orchestration +├── scripts/ # Batch running scripts +├── db/ # SQLite databases (runtime) +├── results/ # Output JSON rollouts +├── plots/ # Result visualizations +├── pyproject.toml # Package definition (uv-based) +└── uv.lock # Lock file +``` + +## Execution Flow + +1. User runs: `uv run yc-bench run --model --seed 1 --config medium` +2. Runner loads config, initializes DB, seeds world, starts agent loop +3. Agent receives system prompt with company context and available CLI tools +4. Each turn: agent calls CLI commands via `run_command` tool, optionally `python_repl` +5. Agent calls `yc-bench sim resume` to advance simulation time +6. Simulation processes events (completions, payroll, milestones) and returns wake events +7. Loop continues until bankruptcy or horizon end +8. Output: rollout JSON transcript + SQLite game state + +## Design Documents + +| File | Topic | +|------|-------| +| [01_simulation_engine.md](01_simulation_engine.md) | Core simulation engine and event processing | +| [02_data_models.md](02_data_models.md) | Database schema and ORM design | +| [03_task_system.md](03_task_system.md) | Task lifecycle, ETA, and progress | +| [04_prestige_system.md](04_prestige_system.md) | Prestige mechanics, decay, and gating | +| [05_financial_model.md](05_financial_model.md) | Funds, payroll, ledger, and bankruptcy | +| [06_employee_model.md](06_employee_model.md) | Employee skills, throughput, and growth | +| [07_agent_layer.md](07_agent_layer.md) | LLM runtime, agent loop, and tools | +| [08_cli_interface.md](08_cli_interface.md) | CLI command groups and JSON output | +| [09_configuration.md](09_configuration.md) | Config schema, presets, and world generation | +| [10_runner_orchestration.md](10_runner_orchestration.md) | Benchmark runner, dashboard, and session | diff --git a/system_design/01_simulation_engine.md b/system_design/01_simulation_engine.md new file mode 100644 index 0000000..836a130 --- /dev/null +++ b/system_design/01_simulation_engine.md @@ -0,0 +1,147 @@ +# Simulation Engine + +**Location**: `src/yc_bench/core/` + +## Design Choice: Discrete-Event Simulation + +YC-Bench uses a **discrete-event simulation (DES)** model rather than a tick-based approach. This was chosen because: + +1. **Determinism**: Events are processed in a fixed, reproducible order given the same seed +2. **Efficiency**: Time jumps between events rather than iterating every hour/day +3. **Clarity**: Each state change corresponds to a meaningful event, making the simulation auditable + +## Core Loop (`engine.py`) + +The `advance_time()` function is the heart of the simulation: + +``` +advance_time(session, company_id, cfg) → AdvanceResult +``` + +### Algorithm + +1. **Flush progress** on all active tasks (convert elapsed business hours into completed work) +2. **Apply prestige decay** for elapsed days +3. **Process payroll** if crossing a month boundary (first business day) +4. **Fetch next unconsumed event** ordered by `(scheduled_at, priority)` +5. **Dispatch to handler** based on event type +6. **Recalculate ETAs** for affected tasks +7. **Update sim_time** to the event's timestamp +8. **Return wake events** to the agent + +### Why "Resume" Rather Than Auto-Advance? + +The agent explicitly calls `yc-bench sim resume` to advance time. This design: + +- Gives the agent control over pacing (plan before advancing) +- Creates a natural decision checkpoint between simulation steps +- Allows multiple CLI queries before committing to advancing +- If the agent stalls (N turns without resume), the loop forces one automatically + +## Event System (`events.py`) + +### Event Types (Priority Order) + +| Priority | Event Type | Trigger | +|----------|-----------|---------| +| 1 | `task_completed` | Task reaches 100% in all domain requirements | +| 2 | `bankruptcy` | Funds drop below zero after payroll | +| 3 | `task_half` | Task reaches 50% progress milestone | +| 4 | `horizon_end` | Simulation time limit reached | + +### Design Choice: Fixed Priority Ordering + +Events at the same timestamp are processed in strict priority order. This ensures: + +- Task completions (and their rewards) are processed before bankruptcy checks +- A task finishing on the same day as payroll can save the company from bankruptcy +- Deterministic behavior regardless of insertion order + +### Event Identity (Deterministic UUIDs) + +Event IDs use `uuid5` based on payload + timestamp + dedupe_key. This means: + +- Same world state produces identical event IDs +- Deduplication is automatic (re-inserting same event is a no-op) +- Full reproducibility across runs with same seed + +## Event Handlers (`handlers/`) + +### `task_complete.py` +- Finalizes all domain progress to 100% +- Success check: `sim_time <= deadline` +- On success: add reward funds, add prestige per domain, boost employee skill rates, apply 1% salary bump +- On failure (late): apply prestige penalty per domain (configurable multiplier) + +### `task_half.py` +- Marks progress milestone reached +- Informational event for agent awareness (no state changes beyond flag) + +### `bankruptcy.py` +- Triggered when `funds_cents < 0` after payroll +- Terminates the simulation with bankruptcy outcome + +### `horizon_end.py` +- Triggered at configured simulation end date +- Terminates the simulation with final scoring + +## Progress Tracking (`progress.py`) + +### Effective Rate Calculation + +``` +effective_rate = base_rate_per_hour / num_active_tasks_for_this_employee +``` + +**Design choice**: Throughput splitting creates a resource allocation puzzle. An employee assigned to 3 tasks works at 1/3 speed on each. The agent must balance parallelism vs. focus. + +### Progress Flush + +When `advance_time()` runs, it calculates work done since the last flush: + +``` +work = effective_rate × business_hours_elapsed +completed_qty += work (capped at required_qty) +``` + +## Business Time (`business_time.py`) + +### Design Choice: Business Hours Only + +Work only happens during business hours (weekdays, configurable hours per day). This adds: + +- Realistic scheduling constraints +- Weekend gaps that affect deadline calculations +- A reason for the agent to think about calendar timing + +## ETA Solver (`eta.py`) + +### Completion Time + +``` +solve_task_completion_time(): + For each domain d: + remaining[d] = required_qty[d] - completed_qty[d] + rate[d] = sum(effective_rate for assigned employees with skill in d) + time[d] = remaining[d] / rate[d] + completion_time = max(time[d]) across all domains +``` + +### Design Choice: Multi-Domain Bottleneck + +A task completes when ALL domains finish. The slowest domain determines completion time. This creates interesting assignment puzzles where the agent must identify and address bottlenecks. + +### Halfway Time + +Used for progress milestone events. Calculated as the weighted midpoint across domains. + +## Prestige Decay + +``` +apply_prestige_decay(session, company_id, days_elapsed, cfg): + for each domain: + prestige -= decay_per_day × days_elapsed + prestige = max(prestige, prestige_min) # floor at 1.0 +``` + +**Design choice**: Decay prevents "set and forget" strategies. The agent must continuously work in domains to maintain access to high-tier tasks. Neglected domains revert to baseline. diff --git a/system_design/02_data_models.md b/system_design/02_data_models.md new file mode 100644 index 0000000..51e7847 --- /dev/null +++ b/system_design/02_data_models.md @@ -0,0 +1,190 @@ +# Data Models & Database Design + +**Location**: `src/yc_bench/db/` + +## Design Choice: SQLAlchemy ORM with SQLite + +The benchmark uses SQLAlchemy's declarative ORM over SQLite for several reasons: + +1. **Single-file persistence**: SQLite stores the entire game state in one file, making runs portable and inspectable +2. **Transactional safety**: ACID guarantees prevent partial state updates +3. **Query flexibility**: SQL allows complex queries for financial reports, task filtering, etc. +4. **Dual-backend support**: The same ORM works with PostgreSQL via `DATABASE_URL` environment variable for production/scaling scenarios + +## Schema Overview + +``` +┌──────────────┐ ┌───────────────────┐ +│ Company │────<│ CompanyPrestige │ (1 per domain × company) +└──────┬───────┘ └───────────────────┘ + │ + ├────<┌──────────────┐ ┌──────────────────┐ + │ │ Employee │────<│ EmployeeSkillRate │ (1 per domain × employee) + │ └──────┬───────┘ └──────────────────┘ + │ │ + │ │ ┌────────────────┐ + │ └───<│ TaskAssignment │ (employee ↔ task junction) + │ └────────┬───────┘ + │ │ + ├────<┌──────────┐────────┘ + │ │ Task │────<┌─────────────────┐ + │ └──────────┘ │ TaskRequirement │ (1 per domain × task) + │ └─────────────────┘ + │ + ├────<┌──────────────┐ + │ │ SimEvent │ (discrete events queue) + │ └──────────────┘ + │ + ├────<┌──────────────┐ + │ │ LedgerEntry │ (financial transactions) + │ └──────────────┘ + │ + ├────<┌──────────────┐ + │ │ SimState │ (simulation clock & counters) + │ └──────────────┘ + │ + └────<┌──────────────┐ + │ Scratchpad │ (agent persistent memory) + └──────────────┘ +``` + +## Model Details + +### Company (`models/company.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `id` | UUID (PK) | Auto-generated | +| `name` | String | Company name | +| `funds_cents` | BigInteger | Financial balance in cents | + +**Design choice**: Funds stored in cents (integer) to avoid floating-point rounding errors in financial calculations. BigInteger supports very large/negative values. + +### CompanyPrestige (`models/company.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `company_id` | UUID (FK) | References Company | +| `domain` | String | research / inference / data_environment / training | +| `prestige_level` | Float | Range [1.0, 10.0] | + +**Design choice**: Prestige is tracked per-domain rather than as a single score. This forces specialization trade-offs and creates a 4-dimensional progression space. + +### Employee (`models/employee.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `id` | UUID (PK) | Auto-generated | +| `company_id` | UUID (FK) | References Company | +| `name` | String | Employee name | +| `tier` | String | junior / mid / senior | +| `work_hours_per_day` | Float | Hours available per business day | +| `salary_cents` | BigInteger | Monthly salary in cents | + +### EmployeeSkillRate (`models/employee.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `employee_id` | UUID (FK) | References Employee | +| `domain` | String | One of 4 domains | +| `rate_domain_per_hour` | Float | Work units produced per hour | + +**Design choice**: Skill rates are **hidden from the agent**. The agent sees tier and salary but not per-domain effectiveness. This creates an information asymmetry puzzle -- the agent must infer employee strengths from task outcomes. + +### Task (`models/task.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `id` | UUID (PK) | Auto-generated | +| `company_id` | UUID (FK, nullable) | NULL = market task, set on acceptance | +| `status` | Enum | market → planned → active → completed_success / completed_fail / cancelled | +| `title` | String | Task description | +| `required_prestige` | Float | Minimum prestige needed in ALL task domains | +| `reward_funds_cents` | BigInteger | Payment on successful completion | +| `reward_prestige_delta` | Float | Prestige gained per domain on success | +| `skill_boost_pct` | Float | Employee skill rate increase on success | +| `accepted_at` | DateTime (nullable) | When task was accepted from market | +| `deadline` | DateTime (nullable) | Calculated at acceptance | +| `completed_at` | DateTime (nullable) | When task finished | +| `success` | Boolean (nullable) | True = on-time, False = late | +| `progress_milestone_pct` | Float | Tracks progress milestones (e.g., 50%) | + +**Design choice**: `company_id` being nullable elegantly distinguishes market tasks (available for browsing) from accepted tasks (owned by the company). + +### TaskRequirement (`models/task.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `task_id` | UUID (FK) | References Task | +| `domain` | String | Which domain this requirement covers | +| `required_qty` | Float | Total work units needed | +| `completed_qty` | Float | Work units completed so far | + +**Design choice**: Multi-domain requirements make tasks a multi-dimensional optimization problem. A task might need work in 2-4 domains simultaneously. + +### TaskAssignment (`models/task.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `task_id` | UUID (FK) | References Task | +| `employee_id` | UUID (FK) | References Employee | +| `assigned_at` | DateTime | When assigned | + +**Design choice**: Many-to-many junction table. An employee can work on multiple tasks (throughput splits), and a task can have multiple employees (parallel progress). + +### SimEvent (`models/event.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `id` | UUID (PK) | Deterministic (uuid5) | +| `company_id` | UUID (FK) | References Company | +| `event_type` | String | task_completed / bankruptcy / task_half / horizon_end | +| `scheduled_at` | DateTime | When event triggers | +| `payload` | JSON | Event-specific data | +| `dedupe_key` | String | Prevents duplicate events | +| `consumed` | Boolean | True after processing | + +### LedgerEntry (`models/ledger.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `id` | UUID (PK) | Auto-generated | +| `company_id` | UUID (FK) | References Company | +| `occurred_at` | DateTime | Transaction timestamp | +| `category` | Enum | MONTHLY_PAYROLL / TASK_REWARD / TASK_FAIL_PENALTY / TASK_CANCEL_PENALTY | +| `amount_cents` | BigInteger | Signed amount (negative = cost) | +| `ref_type` | String (nullable) | Reference entity type | +| `ref_id` | UUID (nullable) | Reference entity ID | + +**Design choice**: Immutable append-only ledger provides a complete financial audit trail. No entries are ever deleted or modified. + +### SimState (`models/sim_state.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `company_id` | UUID (FK, PK) | References Company | +| `sim_time` | DateTime | Current simulation clock | +| `run_seed` | Integer | RNG seed for reproducibility | +| `horizon_end` | DateTime | When simulation ends | +| `replenish_counter` | Integer | Tracks market task replenishment | + +### Scratchpad (`models/scratchpad.py`) + +| Column | Type | Notes | +|--------|------|-------| +| `company_id` | UUID (FK) | References Company | +| `content` | Text | Free-form agent notes | + +**Design choice**: Scratchpad survives LLM context truncation, giving the agent persistent memory across the full simulation. + +## Session Management (`session.py`) + +```python +session_scope(factory) → context manager +``` + +- Creates a scoped session with automatic commit/rollback +- Supports both SQLite (default) and PostgreSQL (via `DATABASE_URL`) +- `init_db()` creates all tables from ORM metadata + +**Design choice**: Context manager pattern ensures every database operation is properly transacted, preventing partial state updates that would corrupt the simulation. diff --git a/system_design/03_task_system.md b/system_design/03_task_system.md new file mode 100644 index 0000000..95032be --- /dev/null +++ b/system_design/03_task_system.md @@ -0,0 +1,144 @@ +# Task System + +**Location**: `src/yc_bench/cli/task_commands.py`, `src/yc_bench/core/eta.py`, `src/yc_bench/core/progress.py` + +## Task Lifecycle + +``` +market ──accept──> planned ──dispatch──> active ──complete──> completed_success + │ │ completed_fail + │ │ + └──cancel──> cancelled <──cancel──┘ +``` + +### States + +| Status | Meaning | +|--------|---------| +| `market` | Available for browsing, not yet accepted | +| `planned` | Accepted by company, employees can be assigned | +| `active` | Dispatched, work is progressing | +| `completed_success` | Finished on time | +| `completed_fail` | Finished late (past deadline) | +| `cancelled` | Abandoned by agent | + +## Design Choices + +### Two-Phase Activation (Accept → Dispatch) + +Tasks go through `planned` before `active`. This separation: + +1. **Allows pre-assignment**: Agent can assign employees before starting the clock +2. **Deadline starts at accept**: Creates urgency -- planning time counts against the deadline +3. **Forces commitment**: Accepting a task reserves it but the agent must still dispatch + +### Deadline Calculation + +``` +deadline = accepted_at + max(required_qty[d] for all domains d) / deadline_qty_per_day +``` + +**Design choice**: Deadline is proportional to the largest single-domain requirement, not the sum. This means multi-domain tasks don't get proportionally more time -- they require parallel work. + +### Prestige Gating at Accept Time + +```python +def task_accept(task_id): + for domain in task.requirements: + if company_prestige[domain] < task.required_prestige: + reject("Insufficient prestige in {domain}") +``` + +**Design choice**: Prestige check is per-domain. A task requiring prestige 3.0 with requirements in `research` and `inference` needs prestige >= 3.0 in BOTH domains. This prevents gaming by maxing one domain. + +### Cancel Penalties + +Cancelling an active task incurs: +- Prestige penalty: `reward_prestige_delta × cancel_multiplier` (configurable per difficulty) +- No financial penalty (just lost opportunity) + +**Design choice**: Cancel penalties prevent the strategy of accepting everything and dropping what's inconvenient. Higher difficulties increase the cancel multiplier. + +## Employee Assignment + +### Assignment Rules + +- Employees can only be assigned to `planned` or `active` tasks +- An employee can work on multiple tasks simultaneously (throughput splits) +- Multiple employees can work on the same task (parallel progress) + +### Throughput Splitting + +``` +effective_rate = base_rate_per_hour / num_active_tasks +``` + +**Design choice**: Linear throughput splitting creates a fundamental trade-off: +- **Focus**: 1 employee on 1 task = full speed +- **Parallel**: 1 employee on 3 tasks = 1/3 speed each +- The agent must decide between fast completion of few tasks vs. slow progress on many + +## Progress Tracking (`progress.py`) + +### How Work Gets Done + +Progress is calculated lazily during `advance_time()`: + +```python +for each active task: + for each assigned employee: + for each domain in task requirements: + work = employee.skill_rate[domain] / num_active_tasks × business_hours + requirement.completed_qty += work + requirement.completed_qty = min(completed_qty, required_qty) +``` + +### Multi-Domain Completion + +A task is complete when ALL domain requirements reach `completed_qty >= required_qty`. The slowest domain is the bottleneck. + +**Design choice**: This creates interesting optimization puzzles. If a task needs 100 units of research and 50 units of training, the agent should allocate more research-skilled employees to balance completion times. + +## ETA Solver (`eta.py`) + +### Completion Time Calculation + +```python +def solve_task_completion_time(task, assignments, sim_time): + for each domain d: + remaining = required_qty[d] - completed_qty[d] + rate = sum(effective_rate[emp][d] for emp in assignments) + if rate == 0: + return infinity # no one can work on this domain + hours_needed[d] = remaining / rate + + max_hours = max(hours_needed.values()) + return sim_time + max_hours (in business hours) +``` + +### Halfway Time Calculation + +Used for milestone events. Finds the time when weighted average across domains reaches 50%. + +### When ETAs Are Recalculated + +- Task dispatched (new active task) +- Employee assigned/unassigned +- Task completed (frees employee throughput for other tasks) +- Task cancelled (same) + +**Design choice**: Dynamic ETA recalculation ensures events are always accurate. When an employee is reassigned, all affected tasks get new completion projections. + +## Market Task Generation + +See [09_configuration.md](09_configuration.md) for details on how market tasks are generated with stratified prestige distribution and randomized requirements. + +### Browsing and Filtering + +The `market browse` command supports: +- Domain filter +- Prestige range filter +- Reward range filter +- Pagination (offset/limit) + +All output is JSON for agent consumption. diff --git a/system_design/04_prestige_system.md b/system_design/04_prestige_system.md new file mode 100644 index 0000000..32ea160 --- /dev/null +++ b/system_design/04_prestige_system.md @@ -0,0 +1,123 @@ +# Prestige System + +**Location**: `src/yc_bench/db/models/company.py` (CompanyPrestige), `src/yc_bench/core/engine.py` (decay), `src/yc_bench/core/handlers/task_complete.py` (rewards/penalties) + +## Overview + +Prestige is YC-Bench's core progression mechanic. It controls access to higher-tier tasks (which offer better rewards) and decays over time, forcing continuous engagement. + +## Design Choices + +### Per-Domain Prestige (4 Independent Tracks) + +``` +research: ████████░░ (8.0) +inference: ██████░░░░ (6.0) +data_environment: ███░░░░░░░ (3.0) +training: █████░░░░░ (5.0) +``` + +**Why 4 domains?** This creates a 4-dimensional strategic space: +- The agent can't max all domains simultaneously (decay + limited employees) +- Specialization unlocks high-tier tasks in 1-2 domains +- Diversification provides resilience but slower progression +- Multi-domain tasks require balanced prestige across their domains + +### Prestige Range: [1.0, 10.0] + +| Level | Meaning | +|-------|---------| +| 1.0 | Minimum (starting/decayed) | +| 3.0-4.0 | Mid-tier tasks accessible | +| 7.0-8.0 | High-tier tasks accessible | +| 10.0 | Maximum (hard cap) | + +**Design choice**: The 1-10 range is intuitive and provides enough granularity for meaningful gating tiers without over-complicating the system. + +## Prestige Gain + +On successful task completion (on-time): + +``` +for each domain in task.requirements: + company_prestige[domain] += task.reward_prestige_delta + company_prestige[domain] = min(prestige, 10.0) # cap +``` + +**Design choice**: Prestige gain is per-domain and tied to the task's requirements. Completing a research+inference task only boosts those two domains, not training or data_environment. + +### Prestige Scaling of Rewards + +``` +actual_reward = base_reward × (1 + reward_prestige_scale × (prestige - 1)) +``` + +Higher prestige in a domain means better financial returns from tasks in that domain. This creates a virtuous cycle: more prestige → more money → more capacity → more prestige. + +## Prestige Loss + +### Decay (Daily) + +``` +prestige -= decay_per_day × days_elapsed +prestige = max(prestige, 1.0) # floor +``` + +Default decay rate: -0.005/day. This is slow enough to not punish short gaps but fast enough that inactive domains eventually return to baseline. + +**Design choice**: Continuous decay prevents "build once, exploit forever" strategies. The agent must continuously complete tasks in a domain to maintain access. + +### Failure Penalty + +On late task completion: + +``` +for each domain in task.requirements: + company_prestige[domain] -= task.reward_prestige_delta × fail_multiplier + company_prestige[domain] = max(prestige, 1.0) +``` + +Default `fail_multiplier`: 0.8. Late completion costs almost as much prestige as success would have gained. + +### Cancel Penalty + +On task cancellation: + +``` +for each domain in task.requirements: + company_prestige[domain] -= task.reward_prestige_delta × cancel_multiplier + company_prestige[domain] = max(prestige, 1.0) +``` + +Cancel multipliers vary by difficulty (higher on hard/nightmare). + +## Prestige Gating + +Tasks have a `required_prestige` field. At task acceptance: + +```python +for domain in task.requirements: + if company_prestige[domain] < task.required_prestige: + reject() # must meet prestige in ALL task domains +``` + +**Design choice**: Per-domain gating means a task with `required_prestige=5.0` and requirements in research + training needs prestige >= 5.0 in BOTH research AND training. This prevents gaming. + +### Stratified Market Tasks + +The first 10 market tasks are always prestige-1 (accessible immediately). Higher prestige tasks are introduced with stratified distribution. This ensures: + +- The agent always has something to work on initially +- Progression is visible (new tasks unlock as prestige grows) +- No dead-end states where the agent can't accept any task + +## Strategic Implications + +The prestige system creates several key strategic tensions: + +1. **Specialize vs. Diversify**: Focus on 1-2 domains for deep access, or spread across all 4? +2. **Risk vs. Reward**: High-prestige tasks pay more but failure costs more prestige +3. **Maintenance vs. Growth**: Should the agent keep working in mastered domains (maintenance) or push new ones (growth)? +4. **Accept vs. Defer**: Taking a task you might fail risks prestige loss; waiting risks decay + +These tensions make the benchmark more than just "do tasks fast" -- it tests genuine strategic reasoning. diff --git a/system_design/05_financial_model.md b/system_design/05_financial_model.md new file mode 100644 index 0000000..9bba2b1 --- /dev/null +++ b/system_design/05_financial_model.md @@ -0,0 +1,162 @@ +# Financial Model + +**Location**: `src/yc_bench/db/models/ledger.py`, `src/yc_bench/cli/finance_commands.py`, `src/yc_bench/cli/report_commands.py`, `src/yc_bench/core/handlers/` + +## Overview + +The financial model simulates a startup's cash flow: revenue from completed tasks, costs from employee payroll, and penalties for failures. Running out of money triggers bankruptcy and ends the simulation. + +## Design Choices + +### Cents-Based Integer Arithmetic + +All financial values are stored as `BigInteger` in cents: + +``` +$1,000.00 = 100_000 cents +``` + +**Why cents?** Floating-point arithmetic introduces rounding errors that compound over hundreds of transactions. Integer cents guarantee exact financial accounting -- critical for a deterministic benchmark. + +### Immutable Append-Only Ledger + +Every financial transaction creates a `LedgerEntry` that is never modified or deleted: + +```python +class LedgerEntry: + category: MONTHLY_PAYROLL | TASK_REWARD | TASK_FAIL_PENALTY | TASK_CANCEL_PENALTY + amount_cents: int # negative for costs, positive for revenue + occurred_at: datetime + ref_type: str # optional reference to source entity + ref_id: UUID # optional reference ID +``` + +**Why immutable?** An append-only ledger provides: +- Complete audit trail for debugging +- Ability to reconstruct balance at any point in time +- No risk of silent data corruption +- Natural fit for the `finance ledger` and `report monthly` CLI commands + +## Revenue Sources + +### Task Rewards + +On successful (on-time) completion: + +``` +reward = base_reward × (1 + prestige_scale × (avg_prestige - 1)) +``` + +Where `avg_prestige` is averaged across the task's required domains. Higher prestige = higher payouts. + +**Design choice**: Prestige-scaled rewards create a positive feedback loop that mirrors real business dynamics -- reputation leads to better opportunities. + +### Revenue Timing + +Rewards are credited immediately upon task completion (when the `task_completed` event fires with `success=True`). + +## Cost Sources + +### Monthly Payroll + +Payroll is deducted on the **first business day** of each month: + +``` +total_payroll = sum(employee.salary_cents for all employees) +``` + +**Design choice**: Monthly payroll creates predictable but unavoidable costs. The agent must maintain positive cash flow to cover it. + +### Salary Bumps + +Each completed task increases salaries: + +``` +for each assigned employee: + salary_cents *= 1.01 # 1% increase per completion +``` + +**Design choice**: Compounding salary increases mean success has a hidden cost. Long-running simulations see payroll grow substantially, creating late-game financial pressure even as task rewards scale with prestige. + +### Failure Penalties + +Late task completion incurs no direct financial penalty beyond the missed reward opportunity. However, the prestige loss from failure reduces future reward scaling. + +### Cancel Penalties + +Cancellation may incur a financial penalty depending on configuration (some presets charge a fraction of the reward). + +## Payroll-Event Tie-Breaking + +When payroll and events fall on the same timestamp: + +``` +Payroll is processed BEFORE events +``` + +**Design choice**: This ordering is critical. If a task completes on the same day as payroll: +1. Payroll deducts first (may push funds negative) +2. Task completion reward credits (may save from bankruptcy) +3. Bankruptcy check happens after both + +This gives the agent the benefit of the doubt -- a task completing on payday can save the company. + +## Bankruptcy + +Bankruptcy triggers when `funds_cents < 0` after payroll processing: + +```python +if company.funds_cents < 0: + insert_bankruptcy_event(session, company_id, sim_time) +``` + +**Design choice**: Bankruptcy is checked only after payroll (not after penalties). This simplifies the model and makes payroll the primary survival constraint. + +### Bankruptcy as Terminal State + +Once bankruptcy fires, the simulation ends. There is no recovery mechanic. + +**Why no bailout?** The benchmark tests whether the agent can sustainably manage a business. Allowing recovery would dilute this signal. + +## Financial Reports + +### Ledger Query (`finance ledger`) + +The agent can query the full transaction history with filters: +- Category filter +- Date range filter +- Pagination + +### Monthly P&L (`report monthly`) + +Aggregates transactions by month: + +``` +Month Revenue Payroll Penalties Net +2025-01 $50,000 $30,000 $0 $20,000 +2025-02 $35,000 $30,300 $5,000 -$300 +``` + +**Design choice**: Structured financial reporting gives the agent the data it needs to make informed decisions about task selection and resource allocation. + +## Runway Calculation + +The `company status` command includes a runway estimate: + +``` +runway_months = funds_cents / monthly_payroll_cents +``` + +This helps the agent gauge urgency. Low runway signals that the agent needs profitable tasks quickly. + +## Difficulty Scaling + +Financial pressure scales with difficulty preset: + +| Preset | Initial Funds | Payroll Pressure | Penalties | +|--------|--------------|-----------------|-----------| +| tutorial | Very high | Low | Minimal | +| easy | High | Moderate | Low | +| medium | Moderate | Moderate | Standard | +| hard | Low | High | 1.5x | +| nightmare | Very low | Very high | 2x | diff --git a/system_design/06_employee_model.md b/system_design/06_employee_model.md new file mode 100644 index 0000000..9c97da8 --- /dev/null +++ b/system_design/06_employee_model.md @@ -0,0 +1,143 @@ +# Employee Model + +**Location**: `src/yc_bench/db/models/employee.py`, `src/yc_bench/services/generate_employees.py`, `src/yc_bench/core/progress.py` + +## Overview + +Employees are the company's productive resources. Each has a tier, salary, and hidden per-domain skill rates. The agent must figure out who is good at what through observation and assign them optimally. + +## Design Choices + +### Hidden Skill Rates (Information Asymmetry) + +The agent sees: +- Employee name, tier (junior/mid/senior), salary +- Which tasks they're currently assigned to + +The agent does NOT see: +- Per-domain skill rates (`rate_domain_per_hour`) +- Actual work output per hour + +**Why hidden?** This is a core benchmark design decision: +1. **Tests inference ability**: The agent must infer strengths from task completion patterns +2. **Mirrors reality**: Real managers don't have exact productivity metrics for every skill dimension +3. **Creates learning opportunity**: Early task assignments serve as "probes" to discover team capabilities +4. **Rewards memory**: Agents that remember past performance can make better future assignments + +### Tier System + +| Tier | Typical Rate Range | Salary Range | +|------|-------------------|--------------| +| junior | Low | Low | +| mid | Medium | Medium | +| senior | High | High | + +**Design choice**: Tiers provide a rough signal. Seniors are generally better but not always in every domain. A junior might excel in one domain while a senior is mediocre there. The tier-salary correlation creates a cost-benefit trade-off. + +### Per-Domain Skill Rates + +Each employee has 4 skill rates (one per domain): + +```python +class EmployeeSkillRate: + domain: str # research, inference, data_environment, training + rate_domain_per_hour: float # work units produced per business hour +``` + +Rates are generated from configurable distributions (triangular, beta, etc.) during world seeding. Some employees are specialists (high in one domain, low in others); some are generalists. + +**Design choice**: The 4-rate vector per employee creates a rich assignment optimization space. Optimal assignment requires matching employee strengths to task domain requirements. + +## Throughput Splitting + +When an employee works on multiple active tasks simultaneously: + +``` +effective_rate = base_rate / num_active_tasks +``` + +**Design choice**: Linear splitting (not diminishing returns or context-switching penalties) was chosen for simplicity and predictability. The agent can reason about it without hidden costs. + +### Example + +Employee Alice has `research_rate = 2.0/hr`: +- Assigned to 1 task: contributes 2.0 research units/hour +- Assigned to 3 tasks: contributes 0.67 research units/hour to each + +### Implication for Strategy + +The agent faces a fundamental trade-off: +- **Focused assignment**: 1 employee → 1 task = fastest completion but no parallelism +- **Spread assignment**: 1 employee → N tasks = slower per task but progress on multiple fronts +- **Optimal**: Match the strategy to deadline pressure and task urgency + +## Skill Growth + +On successful task completion, assigned employees get a skill boost: + +```python +for each assigned employee: + for each domain in task.requirements: + skill_rate[domain] *= (1 + task.skill_boost_pct / 100) +``` + +**Design choice**: Skill growth compounds over time. Early investments in employee development pay off later through faster task completion. This creates a "training vs. exploiting" tension. + +### Salary Bumps (Hidden Cost of Growth) + +Each task completion also increases salaries: + +```python +for each assigned employee: + salary_cents *= 1.01 # 1% increase +``` + +**Design choice**: Salary bumps mean that experienced employees cost more. The agent can't infinitely scale employee productivity without also scaling costs. After many completions, payroll may become a significant burden. + +## Employee Generation (`generate_employees.py`) + +### Process + +1. Generate 10 employees per company (configurable) +2. Assign tiers based on configured distribution (e.g., 30% junior, 40% mid, 30% senior) +3. For each employee, generate 4 skill rates from per-tier distributions +4. Set salary based on tier bracket + +### Distribution Types + +Skill rates are drawn from configurable distributions: +- **Triangular**: min/mode/max (default -- creates realistic bell-curve-like distributions) +- **Beta**: alpha/beta parameters (useful for skewed distributions) +- **Normal**: mean/std (truncated to positive values) +- **Uniform**: low/high +- **Constant**: fixed value + +**Design choice**: Configurable distributions allow difficulty presets to create different workforce profiles. Tutorial mode might use tight distributions (predictable employees), while nightmare mode uses wide distributions (unpredictable). + +## Employee Visibility to Agent + +The `employee list` CLI command returns: + +```json +{ + "employees": [ + { + "id": "uuid", + "name": "Alice Chen", + "tier": "senior", + "salary": "$8,000/mo", + "active_tasks": 2 + } + ] +} +``` + +Note: no skill rates, no per-domain breakdown, no historical performance. The agent must build this knowledge through experience. + +## Strategic Considerations + +1. **Discovery phase**: Early on, assign different employees to different domain tasks to learn strengths +2. **Specialization**: Once strengths are known, match employees to their best domains +3. **Load balancing**: Avoid overloading one employee (throughput splitting penalty) +4. **Growth investment**: Assign employees to tasks in domains where they need improvement +5. **Cost awareness**: Track which employees have had many salary bumps diff --git a/system_design/07_agent_layer.md b/system_design/07_agent_layer.md new file mode 100644 index 0000000..45bbda1 --- /dev/null +++ b/system_design/07_agent_layer.md @@ -0,0 +1,243 @@ +# Agent Layer + +**Location**: `src/yc_bench/agent/` + +## Overview + +The agent layer connects an LLM to the simulation via a tool-use interface. It manages the conversation loop, prompt construction, tool execution, and run state tracking. + +## Architecture + +``` +┌─────────────────────────┐ +│ Agent Loop │ +│ (loop.py) │ +├─────────────────────────┤ +│ ┌──────────┐ ┌──────┐ │ +│ │ Prompt │ │ Tools │ │ +│ │ Builder │ │ │ │ +│ └──────────┘ └──────┘ │ +├─────────────────────────┤ +│ LLM Runtime │ +│ (runtime/) │ +│ LiteLLM abstraction │ +├─────────────────────────┤ +│ Run State / Transcript │ +│ (run_state.py) │ +└─────────────────────────┘ +``` + +## Design Choices + +### LiteLLM as LLM Abstraction (`runtime/`) + +The agent uses [LiteLLM](https://github.com/BerriAI/litellm) to abstract away vendor differences: + +```python +# Supports: Anthropic, OpenAI, OpenRouter, Google Gemini, etc. +response = litellm.completion( + model="anthropic/claude-sonnet-4-20250514", + messages=messages, + tools=tools, +) +``` + +**Why LiteLLM?** +- Single interface for all major LLM providers +- Consistent tool-use format across providers +- Easy to benchmark different models on the same scenarios +- Handles auth, retries, and format conversion + +### Tool-Use Interface (Not Text Parsing) + +The agent interacts via structured tool calls, not text command parsing: + +```json +{ + "name": "run_command", + "arguments": { + "command": "yc-bench task list --status active" + } +} +``` + +**Why tool-use?** +- Eliminates parsing ambiguity +- Works with all modern LLMs' native tool-use +- Structured output from CLI commands (JSON) flows cleanly back +- Reduces error rate vs. free-text command generation + +### Available Tools + +#### `run_command` +Executes CLI commands in a subprocess. The agent can run any `yc-bench` CLI command. + +```python +def run_command(command: str) -> str: + """Execute a yc-bench CLI command and return output.""" +``` + +**Design choice**: Subprocess execution provides isolation. The agent can't accidentally modify simulation state outside of defined CLI commands. + +#### `python_repl` (Optional) +A persistent Python interpreter for calculations and data analysis. + +```python +def python_repl(code: str) -> str: + """Execute Python code and return output.""" +``` + +**Design choice**: Some agents benefit from being able to compute (e.g., calculate optimal assignments, project cash flow). This tool is optional and configurable. + +## Agent Loop (`loop.py`) + +### Main Loop + +```python +def run_agent_loop(runtime, session, company_id, cfg): + while not terminal: + # Build messages (system prompt + history) + messages = build_messages(history, context) + + # Call LLM + response = runtime.completion(messages, tools) + + # Process tool calls + for tool_call in response.tool_calls: + result = execute_tool(tool_call) + history.append(tool_call, result) + + # Check for terminal conditions + if is_terminal(result): + break + + # Auto-resume if agent hasn't advanced simulation + if turns_since_resume > max_turns_without_resume: + force_resume() +``` + +### Design Choices in the Loop + +#### History Truncation + +```python +# Keep only last N turns to fit context window +messages = system_prompt + history[-max_history_turns:] +``` + +**Why truncate?** Long simulations generate hundreds of turns. Without truncation, the context would exceed any model's window. The scratchpad CLI command compensates for lost history. + +#### Auto-Resume Forcing + +If the agent doesn't call `yc-bench sim resume` for N turns, the loop forces one: + +```python +if turns_since_resume > cfg.loop.max_turns_without_resume: + result = execute("yc-bench sim resume") +``` + +**Why force?** Some models get stuck in analysis loops, repeatedly querying state without advancing. Auto-resume prevents infinite loops and ensures forward progress. + +#### Turn Budget + +The loop has a maximum turn count. This prevents runaway agents and bounds benchmark cost. + +## Prompt Construction (`prompt.py`) + +### System Prompt Structure + +``` +1. Role description ("You are the CEO of an AI startup...") +2. Available commands reference +3. Current company status summary +4. Strategic guidance (domain, prestige, deadlines) +5. Constraints and rules +``` + +**Design choice**: The system prompt provides enough context for the agent to understand its role without revealing internal mechanics (like hidden skill rates or exact formulas). + +### Context Building + +Each turn, the prompt may include: +- Wake events from the last `sim resume` +- Current funds and runway +- Active task count and approaching deadlines +- Prestige levels + +This contextual information helps the agent make informed decisions without needing to query every turn. + +## Run State (`run_state.py`) + +### Transcript Recording + +Every turn is recorded: + +```python +{ + "turn": 42, + "messages": [...], + "tool_calls": [...], + "tool_results": [...], + "timestamp": "2025-03-15T10:30:00", + "tokens_used": 1500 +} +``` + +**Design choice**: Full transcripts enable: +- Post-hoc analysis of agent strategy +- Debugging agent failures +- Benchmark scoring based on decision quality +- Comparison across models + +### Output Format + +The final rollout is saved as JSON: + +```json +{ + "model": "anthropic/claude-sonnet-4-20250514", + "seed": 42, + "config": "medium", + "outcome": "horizon_end", + "final_funds": 250000, + "final_prestige": {"research": 7.2, ...}, + "turns": 187, + "transcript": [...] +} +``` + +## Command Execution Policy (`commands/`) + +### Command Allowlist + +The agent can only execute `yc-bench` CLI commands. Arbitrary shell commands are blocked. + +**Design choice**: Restricting to the CLI API ensures: +- No direct database manipulation +- No simulation state bypass +- Fair comparison across models +- Deterministic state transitions + +### Error Handling + +Invalid commands return structured error messages: + +```json +{"error": "Task not found", "task_id": "..."} +``` + +**Design choice**: Structured errors help the agent understand and recover from mistakes, rather than receiving opaque stack traces. + +## Retry and Timeout Logic + +```python +# Exponential backoff for LLM API calls +for attempt in range(max_retries): + try: + response = runtime.completion(messages, tools) + break + except RateLimitError: + wait(2 ** attempt) +``` + +**Design choice**: LLM APIs are unreliable. Retry logic ensures transient failures don't corrupt benchmark runs. diff --git a/system_design/08_cli_interface.md b/system_design/08_cli_interface.md new file mode 100644 index 0000000..81e7949 --- /dev/null +++ b/system_design/08_cli_interface.md @@ -0,0 +1,173 @@ +# CLI Interface + +**Location**: `src/yc_bench/cli/` + +## Overview + +The CLI is the agent's sole interface to the simulation. Every command returns structured JSON, enabling reliable parsing by LLMs. + +## Design Choices + +### JSON-Only Output + +All CLI commands return JSON, never free-text: + +```bash +$ yc-bench company status +{ + "company_name": "Nexus AI", + "funds": "$150,000.00", + "funds_cents": 15000000, + "monthly_payroll": "$30,000.00", + "runway_months": 5.0, + "prestige": { + "research": 3.5, + "inference": 2.1, + "data_environment": 1.0, + "training": 4.2 + } +} +``` + +**Why JSON?** +- Unambiguous parsing by LLMs (vs. formatted tables) +- Consistent structure across all commands +- Easy to pipe into `python_repl` for analysis +- Machine-readable without regex or text parsing + +### Command Group Organization + +| Group | File | Purpose | +|-------|------|---------| +| `company` | `company_commands.py` | Company status, prestige overview | +| `employee` | `employee_commands.py` | Employee listing and details | +| `market` | `market_commands.py` | Browse available tasks | +| `task` | `task_commands.py` | Task lifecycle (accept/assign/dispatch/cancel/inspect/list) | +| `sim` | `sim_commands.py` | Simulation control (resume) | +| `finance` | `finance_commands.py` | Ledger queries | +| `report` | `report_commands.py` | Monthly P&L reports | +| `scratchpad` | `scratchpad_commands.py` | Persistent agent memory | + +**Design choice**: Command groups mirror real business functions (operations, HR, finance, strategy). This makes the interface intuitive for LLM agents that have been trained on business concepts. + +## Command Details + +### Company Commands + +#### `company status` +Returns current funds, payroll, runway, and prestige levels per domain. + +**Design choice**: Single command gives the agent a complete financial and strategic snapshot. Reduces the number of API calls needed per decision cycle. + +### Employee Commands + +#### `employee list` +Returns all employees with tier, salary, and current active task count. + +**Design choice**: Shows active task count but NOT skill rates. The agent must infer capabilities. + +### Market Commands + +#### `market browse [--domain X] [--min-prestige N] [--max-prestige N] [--offset O] [--limit L]` +Browse available market tasks with optional filters. + +**Design choice**: Filtering and pagination prevent information overload. The agent can focus on tasks matching its current prestige level and strategic goals. + +### Task Commands + +#### `task accept ` +Accept a market task. Validates prestige requirements. Sets deadline. + +#### `task assign ` +Assign an employee to a planned/active task. Recalculates ETAs. + +#### `task dispatch ` +Start work on a planned task. Changes status to active. + +#### `task cancel ` +Cancel a task. Applies prestige penalty. Frees employees. + +#### `task inspect ` +Detailed view of a single task: requirements, progress, assignments, deadline. + +#### `task list [--status X]` +List company tasks with optional status filter. + +**Design choice**: The accept → assign → dispatch flow gives the agent explicit control over each phase. This mirrors real project management where you scope, staff, and then kick off work. + +### Simulation Commands + +#### `sim resume` +Advance simulation to the next event. Returns wake events. + +```json +{ + "advanced_to": "2025-02-15T09:00:00", + "wake_events": [ + {"type": "task_completed", "task_id": "...", "success": true}, + {"type": "payroll", "amount": -3000000} + ] +} +``` + +**Design choice**: Resume is the only way to advance time. The agent explicitly chooses when to move forward, creating natural decision checkpoints. + +### Finance Commands + +#### `finance ledger [--category X] [--from DATE] [--to DATE] [--offset O] [--limit L]` +Query the immutable transaction history. + +**Design choice**: Full ledger access lets sophisticated agents analyze spending patterns and project future cash flow. + +### Report Commands + +#### `report monthly` +Aggregated P&L by month. + +**Design choice**: Monthly reports provide a higher-level financial view than raw ledger entries, useful for strategic planning. + +### Scratchpad Commands + +#### `scratchpad read` +Read persistent notes. + +#### `scratchpad write ` +Overwrite scratchpad contents. + +#### `scratchpad append ` +Add to existing scratchpad. + +#### `scratchpad clear` +Clear scratchpad. + +**Design choice**: The scratchpad is critical for long simulations where LLM context gets truncated. The agent can store: +- Employee capability observations +- Strategic plans +- Financial projections +- Task priority lists + +This compensates for context window limitations and tests whether the agent proactively maintains external memory. + +## Error Handling + +All commands return structured errors: + +```json +{ + "error": "Insufficient prestige in research (have 2.3, need 4.0)" +} +``` + +**Design choice**: Descriptive error messages help the agent understand what went wrong and adjust its strategy, rather than failing silently or with cryptic messages. + +## CLI Entry Point (`__main__.py`) + +The CLI uses a command-line parser (likely Click or argparse) to route commands to handler functions. Each handler: + +1. Opens a database session +2. Validates inputs +3. Performs the operation +4. Returns JSON output +5. Commits or rolls back the transaction + +**Design choice**: Each CLI call is a self-contained transaction. This prevents partial state updates and ensures the simulation remains consistent. diff --git a/system_design/09_configuration.md b/system_design/09_configuration.md new file mode 100644 index 0000000..7edb14e --- /dev/null +++ b/system_design/09_configuration.md @@ -0,0 +1,203 @@ +# Configuration System + +**Location**: `src/yc_bench/config/` + +## Overview + +The configuration system uses Pydantic models validated from TOML preset files. It controls every aspect of the simulation: world generation parameters, difficulty tuning, agent behavior, and distribution specifications. + +## Design Choices + +### Pydantic Schema (`schema.py`) + +The configuration hierarchy: + +``` +ExperimentConfig +├── AgentConfig # LLM model, tools, retry settings +├── LoopConfig # Turn budget, auto-resume threshold +├── SimConfig # Simulation parameters +└── WorldConfig # World generation parameters + ├── CompanyConfig # Initial funds, starting prestige + ├── EmployeeConfig # Team size, tier distribution, salary ranges + ├── TaskConfig # Task count, domain requirements, deadlines + └── PrestigeConfig # Decay rate, penalty multipliers, scaling +``` + +**Why Pydantic?** +- Type validation at load time (catch config errors early) +- Default values with optional overrides +- Discriminated unions for distribution specs +- Clear documentation through type annotations +- Serialization to/from TOML/JSON + +### TOML Preset Files (`presets/`) + +```toml +# medium.toml +[world] +initial_funds_cents = 500_000_00 + +[world.prestige] +decay_per_day = 0.005 +penalty_fail_multiplier = 0.8 +penalty_cancel_multiplier = 1.0 + +[world.tasks] +count = 200 +deadline_qty_per_day = 11.0 + +[world.tasks.reward_funds] +type = "triangular" +min = 5000_00 +mode = 15000_00 +max = 50000_00 +``` + +**Why TOML?** Human-readable, supports comments, natural hierarchy via sections, widely supported in Python. Better than JSON for config files (comments), simpler than YAML (fewer gotchas). + +### Preset Hierarchy + +| Preset | Focus | Key Characteristics | +|--------|-------|-------------------| +| `default.toml` | Base | All defaults; other presets override selectively | +| `tutorial.toml` | Learning | Relaxed deadlines, prestige-1 tasks only, high funds | +| `easy.toml` | Casual | Relaxed deadlines, flat prestige requirements | +| `medium.toml` | Standard | Prestige climbing, 2-domain tasks, 9-day deadlines | +| `hard.toml` | Challenge | Prestige gating active, 7-day deadlines, 1.5x cancel penalty | +| `nightmare.toml` | Extreme | Razor-thin margins, 6-day deadlines, 2x penalties | + +**Design choice**: Preset-based difficulty rather than a single "difficulty slider" allows fine-grained control. Each preset can tune dozens of independent parameters. + +### Config Loading (`loader.py`) + +```python +def load_config(preset_name: str) -> ExperimentConfig: + base = load_toml("default.toml") + overlay = load_toml(f"{preset_name}.toml") + merged = deep_merge(base, overlay) + return ExperimentConfig(**merged) +``` + +**Design choice**: Config inheritance via deep merge. Presets only specify what differs from default, keeping preset files concise and maintainable. + +## Distribution Specifications (`sampling.py`) + +### The DistSpec System + +Many world generation parameters use statistical distributions rather than fixed values: + +```python +class DistSpec(BaseModel): + """Discriminated union of distribution types.""" + type: Literal["triangular", "beta", "normal", "uniform", "constant"] + # Parameters vary by type +``` + +**Supported distributions:** + +| Type | Parameters | Use Case | +|------|-----------|----------| +| `triangular` | min, mode, max | Task rewards, skill rates (natural asymmetric bell curve) | +| `beta` | alpha, beta, scale | Prestige requirements (skewed toward low values) | +| `normal` | mean, std | Symmetric variation around a target | +| `uniform` | low, high | Equal probability across range | +| `constant` | value | Fixed value (no randomness) | + +**Why discriminated unions?** Pydantic validates the correct parameters for each distribution type at load time. Invalid combinations (e.g., triangular with alpha parameter) are caught before the simulation runs. + +### Usage Example + +```toml +[world.tasks.reward_funds] +type = "triangular" +min = 5000_00 +mode = 15000_00 +max = 50000_00 + +[world.employees.junior_rate] +type = "beta" +alpha = 2.0 +beta = 5.0 +scale = 3.0 +``` + +## World Generation + +### Seeding (`services/seed_world.py`) + +```python +def seed_world_transactional(session, cfg, seed): + rng = create_rng(seed) + company = create_company(session, cfg.world.company) + employees = generate_employees(session, company, cfg.world.employees, rng) + tasks = generate_tasks(session, cfg.world.tasks, rng) + sim_state = create_sim_state(session, company, cfg.sim, seed) +``` + +**Design choice**: Single-transaction world seeding ensures atomic creation. Either the entire world is created or nothing is -- no partial states. + +### Employee Generation (`services/generate_employees.py`) + +1. Generate N employees (default 10) +2. Assign tiers from configured distribution (e.g., 30/40/30 junior/mid/senior) +3. For each employee, sample 4 skill rates from per-tier distributions +4. Set salary based on tier range + +### Task Generation (`services/generate_tasks.py`) + +1. Generate M tasks (default 200+) +2. First 10 tasks are always prestige-1 (guaranteed accessible) +3. Remaining tasks have stratified prestige requirements +4. Each task gets 2-4 domain requirements sampled from distributions +5. Rewards scale with prestige and task size + +**Design choice**: Stratified generation ensures: +- The agent always has starting tasks (prestige-1 guaranteed) +- Tasks span the full prestige range (progression is possible) +- No prestige "dead zones" where no tasks exist + +### RNG Management (`services/rng.py`) + +```python +def create_rng(seed: int) -> numpy.random.Generator: + return numpy.random.default_rng(seed) +``` + +**Design choice**: Centralized RNG with explicit seed ensures full reproducibility. Same seed → same world → same event sequence (given same agent actions). + +## Key Configuration Parameters + +### Financial Tuning + +| Parameter | Default | Effect | +|-----------|---------|--------| +| `initial_funds_cents` | 500,000 | Starting capital | +| `reward_prestige_scale` | 0.15 | How much prestige amplifies rewards | +| `salary_bump_pct` | 1.0 | Per-completion salary increase | + +### Prestige Tuning + +| Parameter | Default | Effect | +|-----------|---------|--------| +| `prestige_decay_per_day` | 0.005 | Daily prestige loss | +| `penalty_fail_multiplier` | 0.8 | Prestige cost of late completion | +| `penalty_cancel_multiplier` | 1.0 | Prestige cost of cancellation | +| `prestige_min` | 1.0 | Floor value | +| `prestige_max` | 10.0 | Ceiling value | + +### Task Tuning + +| Parameter | Default | Effect | +|-----------|---------|--------| +| `deadline_qty_per_day` | 11.0 | Deadline generosity | +| `num_domains_per_task` | 2-4 | Multi-domain complexity | +| `progress_milestone_pct` | 50 | When to fire halfway event | + +### Agent Tuning + +| Parameter | Default | Effect | +|-----------|---------|--------| +| `max_turns` | 500 | Hard turn limit | +| `max_turns_without_resume` | 5 | Auto-resume threshold | +| `history_truncation` | 50 | Turns kept in context | diff --git a/system_design/10_runner_orchestration.md b/system_design/10_runner_orchestration.md new file mode 100644 index 0000000..ebcdfbf --- /dev/null +++ b/system_design/10_runner_orchestration.md @@ -0,0 +1,232 @@ +# Runner & Orchestration + +**Location**: `src/yc_bench/runner/` + +## Overview + +The runner is the top-level orchestration layer that ties everything together: parsing arguments, loading configuration, initializing the database, seeding the world, starting the agent loop, and collecting results. + +## Components + +### Entry Point (`main.py`) + +```python +def run_benchmark(args): + # 1. Load configuration + cfg = load_config(args.config) + + # 2. Initialize database + engine, factory = init_db(db_path) + + # 3. Seed world + with session_scope(factory) as session: + seed_world_transactional(session, cfg, args.seed) + + # 4. Build agent runtime + runtime = build_runtime(cfg.agent, args.model) + + # 5. Start dashboard (if TTY) + dashboard = Dashboard(cfg) if is_tty() else None + + # 6. Run agent loop + result = run_agent_loop(runtime, factory, cfg, dashboard) + + # 7. Save results + save_rollout(result, args.output) +``` + +### Design Choices + +#### Single-Command Invocation + +```bash +uv run yc-bench run --model gemini/gemini-3-flash --seed 1 --config medium +``` + +**Why single command?** Benchmarks should be easy to reproduce. One command with explicit parameters (model, seed, config) fully specifies a run. + +#### Database Per Run + +Each run creates a fresh SQLite database: + +``` +db/run_seed1_medium_2025-03-15.sqlite +``` + +**Why per-run databases?** +- Isolation: runs can't interfere with each other +- Inspection: can analyze any run's final state after the fact +- Reproducibility: re-running with same seed produces identical database +- Parallelism: multiple runs can execute simultaneously + +## Argument Parsing (`args.py`) + +### Key Arguments + +| Argument | Required | Description | +|----------|----------|-------------| +| `--model` | Yes | LLM model identifier (LiteLLM format) | +| `--seed` | Yes | Random seed for world generation | +| `--config` | No | Difficulty preset (default: "medium") | +| `--output` | No | Output path for rollout JSON | +| `--no-dashboard` | No | Disable live terminal UI | +| `--max-turns` | No | Override turn limit | + +**Design choice**: Required arguments are minimal (model + seed). Everything else has sensible defaults. This reduces barrier to running benchmarks while allowing full customization. + +## Dashboard (`dashboard.py`) + +### Live Terminal UI + +The dashboard uses [Rich](https://github.com/Textualize/rich) to display real-time simulation state: + +``` +┌─ YC-Bench Dashboard ──────────────────────────────┐ +│ Model: claude-sonnet-4 Seed: 42 Config: medium │ +│ Turn: 87/500 Sim Time: 2025-06-15 │ +├────────────────────────────────────────────────────┤ +│ Funds: $125,340 Runway: 4.2 months │ +│ Prestige: R:5.2 I:3.8 D:2.1 T:6.4 │ +│ Active Tasks: 3 Completed: 12 Failed: 1 │ +├────────────────────────────────────────────────────┤ +│ Last Action: task assign abc123 emp456 │ +│ Last Event: task_completed (success) │ +└────────────────────────────────────────────────────┘ +``` + +**Design choice**: The dashboard is for human observers, not the agent. It provides real-time visibility into benchmark runs without affecting agent behavior. + +### Features + +- Live fund tracking with trend indicators +- Prestige levels per domain +- Task status counters +- Recent agent actions +- Turn counter and simulation clock +- Auto-refreshes on each turn + +### Conditional Activation + +Dashboard only activates when running in a TTY (interactive terminal). Redirected output or CI environments get plain log output. + +**Why conditional?** Batch runs (scripts/) shouldn't have terminal UI overhead. Detecting TTY ensures the right output mode automatically. + +## Session Management (`session.py`) + +### Run Session + +Manages the lifecycle of a single benchmark run: + +```python +class RunSession: + db_path: str + config: ExperimentConfig + model: str + seed: int + start_time: datetime + + def save_rollout(self, result): + """Save final rollout JSON to results/""" + + def cleanup(self): + """Clean up temporary resources""" +``` + +**Design choice**: Session object encapsulates all run-specific state, making it easy to serialize and manage runs. + +## Batch Running (`scripts/`) + +### Multi-Seed Runs + +Scripts for running the same model across multiple seeds: + +```bash +# Run seeds 1-10 with claude-sonnet on medium difficulty +for seed in $(seq 1 10); do + uv run yc-bench run --model anthropic/claude-sonnet-4-20250514 --seed $seed --config medium +done +``` + +### Multi-Model Comparison + +Scripts for comparing models on the same seeds: + +```bash +for model in "anthropic/claude-sonnet-4-20250514" "openai/gpt-4o" "google/gemini-pro"; do + uv run yc-bench run --model $model --seed 42 --config medium +done +``` + +**Design choice**: Simple shell scripts rather than a complex orchestration framework. This keeps the benchmark tooling minimal and transparent. + +## Results & Output + +### Rollout JSON + +Each run produces a rollout file: + +``` +results/ +├── claude-sonnet_seed1_medium.json +├── claude-sonnet_seed2_medium.json +├── gpt-4o_seed1_medium.json +└── ... +``` + +### Rollout Contents + +```json +{ + "metadata": { + "model": "anthropic/claude-sonnet-4-20250514", + "seed": 1, + "config": "medium", + "start_time": "2025-03-15T10:00:00", + "end_time": "2025-03-15T10:45:00" + }, + "outcome": "horizon_end", + "final_state": { + "funds_cents": 25000000, + "prestige": {"research": 7.2, "inference": 5.1, ...}, + "tasks_completed": 24, + "tasks_failed": 3, + "tasks_cancelled": 1, + "turns_used": 187 + }, + "transcript": [ + {"turn": 1, "action": "company status", "result": {...}}, + ... + ] +} +``` + +### Plots (`plots/`) + +Visualization scripts for comparing model performance: +- Funds over time +- Prestige progression per domain +- Task completion rates +- Comparison charts across models/seeds + +**Design choice**: Separate plotting from the benchmark runner. Results are stored as data (JSON); visualization is a post-processing step. + +## Error Recovery + +### Crash Recovery + +If a run crashes (LLM timeout, OOM, etc.): +- The SQLite database persists with the last consistent state +- Rollout JSON may be partial but includes transcript up to the crash +- Re-running with the same seed starts fresh (no resume from crash) + +**Design choice**: No crash recovery by design. Benchmark runs should be atomic -- either complete or re-run. This prevents partial results from contaminating comparisons. + +### Graceful Shutdown + +On SIGINT (Ctrl+C): +- Current turn completes +- Partial rollout is saved +- Database is committed +- Dashboard is cleaned up + +**Design choice**: Graceful shutdown preserves whatever data exists, useful for debugging long runs that need to be interrupted.