Add system design documentation for yc-bench

Comprehensive documentation covering all major subsystems:
simulation engine, data models, task system, prestige, finances,
employees, agent layer, CLI interface, configuration, and runner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
AnandK27 2026-03-08 13:42:41 -07:00
parent b1cd7ebfb2
commit ecd3d9e415
11 changed files with 1858 additions and 0 deletions

View file

@ -0,0 +1,98 @@
# YC-Bench: System Overview
## What is YC-Bench?
YC-Bench is a **long-horizon deterministic benchmark for LLM agents**. It simulates an AI startup CEO managing a company over 1-3 years through a CLI-based interface against a SQLite-backed discrete-event simulation engine. The benchmark tests sustained decision-making over hundreds of turns through compounding financial, prestige, and deadline pressures.
## Core Premise
An LLM agent is dropped into the role of CEO of a small AI startup. It must:
- Browse and accept tasks from a marketplace
- Assign employees to tasks across 4 technical domains
- Manage cash flow (payroll, rewards, penalties)
- Build prestige in each domain to unlock higher-tier tasks
- Survive until the simulation horizon ends without going bankrupt
## Key Metrics (~4,975 lines of Python)
| Dimension | Details |
|-----------|---------|
| Employees | 10 (hidden per-domain skill rates) |
| Market Tasks | 200+ (configurable) |
| Domains | 4: research, inference, data_environment, training |
| Prestige Range | 1.0 - 10.0 per domain |
| Difficulty Presets | tutorial, easy, medium, hard, nightmare |
## High-Level Architecture
```
┌─────────────────────────────────────────────────────┐
│ Runner / CLI │
│ (argument parsing, dashboard, session management) │
├─────────────────────────────────────────────────────┤
│ Agent Layer │
│ (LLM runtime, agent loop, tools, prompt building) │
├─────────────────────────────────────────────────────┤
│ CLI Command Interface │
│ (company, employee, market, task, sim, finance, │
│ report, scratchpad) │
├─────────────────────────────────────────────────────┤
│ Simulation Engine (core/) │
│ (event processing, ETA solving, progress tracking, │
│ business time, prestige decay) │
├─────────────────────────────────────────────────────┤
│ Data Layer (db/) │
│ (SQLAlchemy ORM models, session management) │
├─────────────────────────────────────────────────────┤
│ Configuration & World Generation │
│ (Pydantic schemas, TOML presets, seeding, RNG) │
└─────────────────────────────────────────────────────┘
```
## Directory Map
```
~/yc_bench_fixed/
├── src/yc_bench/
│ ├── __main__.py # CLI entry point
│ ├── agent/ # Agent runtime and loop
│ ├── cli/ # Agent-facing CLI commands
│ ├── core/ # Simulation engine
│ ├── db/ # ORM models & session
│ ├── config/ # Pydantic schemas + TOML presets
│ ├── services/ # World generation & RNG
│ └── runner/ # Benchmark orchestration
├── scripts/ # Batch running scripts
├── db/ # SQLite databases (runtime)
├── results/ # Output JSON rollouts
├── plots/ # Result visualizations
├── pyproject.toml # Package definition (uv-based)
└── uv.lock # Lock file
```
## Execution Flow
1. User runs: `uv run yc-bench run --model <model> --seed 1 --config medium`
2. Runner loads config, initializes DB, seeds world, starts agent loop
3. Agent receives system prompt with company context and available CLI tools
4. Each turn: agent calls CLI commands via `run_command` tool, optionally `python_repl`
5. Agent calls `yc-bench sim resume` to advance simulation time
6. Simulation processes events (completions, payroll, milestones) and returns wake events
7. Loop continues until bankruptcy or horizon end
8. Output: rollout JSON transcript + SQLite game state
## Design Documents
| File | Topic |
|------|-------|
| [01_simulation_engine.md](01_simulation_engine.md) | Core simulation engine and event processing |
| [02_data_models.md](02_data_models.md) | Database schema and ORM design |
| [03_task_system.md](03_task_system.md) | Task lifecycle, ETA, and progress |
| [04_prestige_system.md](04_prestige_system.md) | Prestige mechanics, decay, and gating |
| [05_financial_model.md](05_financial_model.md) | Funds, payroll, ledger, and bankruptcy |
| [06_employee_model.md](06_employee_model.md) | Employee skills, throughput, and growth |
| [07_agent_layer.md](07_agent_layer.md) | LLM runtime, agent loop, and tools |
| [08_cli_interface.md](08_cli_interface.md) | CLI command groups and JSON output |
| [09_configuration.md](09_configuration.md) | Config schema, presets, and world generation |
| [10_runner_orchestration.md](10_runner_orchestration.md) | Benchmark runner, dashboard, and session |

View file

@ -0,0 +1,147 @@
# Simulation Engine
**Location**: `src/yc_bench/core/`
## Design Choice: Discrete-Event Simulation
YC-Bench uses a **discrete-event simulation (DES)** model rather than a tick-based approach. This was chosen because:
1. **Determinism**: Events are processed in a fixed, reproducible order given the same seed
2. **Efficiency**: Time jumps between events rather than iterating every hour/day
3. **Clarity**: Each state change corresponds to a meaningful event, making the simulation auditable
## Core Loop (`engine.py`)
The `advance_time()` function is the heart of the simulation:
```
advance_time(session, company_id, cfg) → AdvanceResult
```
### Algorithm
1. **Flush progress** on all active tasks (convert elapsed business hours into completed work)
2. **Apply prestige decay** for elapsed days
3. **Process payroll** if crossing a month boundary (first business day)
4. **Fetch next unconsumed event** ordered by `(scheduled_at, priority)`
5. **Dispatch to handler** based on event type
6. **Recalculate ETAs** for affected tasks
7. **Update sim_time** to the event's timestamp
8. **Return wake events** to the agent
### Why "Resume" Rather Than Auto-Advance?
The agent explicitly calls `yc-bench sim resume` to advance time. This design:
- Gives the agent control over pacing (plan before advancing)
- Creates a natural decision checkpoint between simulation steps
- Allows multiple CLI queries before committing to advancing
- If the agent stalls (N turns without resume), the loop forces one automatically
## Event System (`events.py`)
### Event Types (Priority Order)
| Priority | Event Type | Trigger |
|----------|-----------|---------|
| 1 | `task_completed` | Task reaches 100% in all domain requirements |
| 2 | `bankruptcy` | Funds drop below zero after payroll |
| 3 | `task_half` | Task reaches 50% progress milestone |
| 4 | `horizon_end` | Simulation time limit reached |
### Design Choice: Fixed Priority Ordering
Events at the same timestamp are processed in strict priority order. This ensures:
- Task completions (and their rewards) are processed before bankruptcy checks
- A task finishing on the same day as payroll can save the company from bankruptcy
- Deterministic behavior regardless of insertion order
### Event Identity (Deterministic UUIDs)
Event IDs use `uuid5` based on payload + timestamp + dedupe_key. This means:
- Same world state produces identical event IDs
- Deduplication is automatic (re-inserting same event is a no-op)
- Full reproducibility across runs with same seed
## Event Handlers (`handlers/`)
### `task_complete.py`
- Finalizes all domain progress to 100%
- Success check: `sim_time <= deadline`
- On success: add reward funds, add prestige per domain, boost employee skill rates, apply 1% salary bump
- On failure (late): apply prestige penalty per domain (configurable multiplier)
### `task_half.py`
- Marks progress milestone reached
- Informational event for agent awareness (no state changes beyond flag)
### `bankruptcy.py`
- Triggered when `funds_cents < 0` after payroll
- Terminates the simulation with bankruptcy outcome
### `horizon_end.py`
- Triggered at configured simulation end date
- Terminates the simulation with final scoring
## Progress Tracking (`progress.py`)
### Effective Rate Calculation
```
effective_rate = base_rate_per_hour / num_active_tasks_for_this_employee
```
**Design choice**: Throughput splitting creates a resource allocation puzzle. An employee assigned to 3 tasks works at 1/3 speed on each. The agent must balance parallelism vs. focus.
### Progress Flush
When `advance_time()` runs, it calculates work done since the last flush:
```
work = effective_rate × business_hours_elapsed
completed_qty += work (capped at required_qty)
```
## Business Time (`business_time.py`)
### Design Choice: Business Hours Only
Work only happens during business hours (weekdays, configurable hours per day). This adds:
- Realistic scheduling constraints
- Weekend gaps that affect deadline calculations
- A reason for the agent to think about calendar timing
## ETA Solver (`eta.py`)
### Completion Time
```
solve_task_completion_time():
For each domain d:
remaining[d] = required_qty[d] - completed_qty[d]
rate[d] = sum(effective_rate for assigned employees with skill in d)
time[d] = remaining[d] / rate[d]
completion_time = max(time[d]) across all domains
```
### Design Choice: Multi-Domain Bottleneck
A task completes when ALL domains finish. The slowest domain determines completion time. This creates interesting assignment puzzles where the agent must identify and address bottlenecks.
### Halfway Time
Used for progress milestone events. Calculated as the weighted midpoint across domains.
## Prestige Decay
```
apply_prestige_decay(session, company_id, days_elapsed, cfg):
for each domain:
prestige -= decay_per_day × days_elapsed
prestige = max(prestige, prestige_min) # floor at 1.0
```
**Design choice**: Decay prevents "set and forget" strategies. The agent must continuously work in domains to maintain access to high-tier tasks. Neglected domains revert to baseline.

View file

@ -0,0 +1,190 @@
# Data Models & Database Design
**Location**: `src/yc_bench/db/`
## Design Choice: SQLAlchemy ORM with SQLite
The benchmark uses SQLAlchemy's declarative ORM over SQLite for several reasons:
1. **Single-file persistence**: SQLite stores the entire game state in one file, making runs portable and inspectable
2. **Transactional safety**: ACID guarantees prevent partial state updates
3. **Query flexibility**: SQL allows complex queries for financial reports, task filtering, etc.
4. **Dual-backend support**: The same ORM works with PostgreSQL via `DATABASE_URL` environment variable for production/scaling scenarios
## Schema Overview
```
┌──────────────┐ ┌───────────────────┐
│ Company │────<│ CompanyPrestige │ (1 per domain × company)
└──────┬───────┘ └───────────────────┘
├────<┌──────────────┐ ┌──────────────────┐
│ │ Employee │────<│ EmployeeSkillRate │ (1 per domain × employee)
│ └──────┬───────┘ └──────────────────┘
│ │
│ │ ┌────────────────┐
│ └───<│ TaskAssignment │ (employee ↔ task junction)
│ └────────┬───────┘
│ │
├────<┌──────────┐────────┘
│ │ Task │────<┌─────────────────┐
│ └──────────┘ │ TaskRequirement │ (1 per domain × task)
│ └─────────────────┘
├────<┌──────────────┐
│ │ SimEvent │ (discrete events queue)
│ └──────────────┘
├────<┌──────────────┐
│ │ LedgerEntry │ (financial transactions)
│ └──────────────┘
├────<┌──────────────┐
│ │ SimState │ (simulation clock & counters)
│ └──────────────┘
└────<┌──────────────┐
│ Scratchpad │ (agent persistent memory)
└──────────────┘
```
## Model Details
### Company (`models/company.py`)
| Column | Type | Notes |
|--------|------|-------|
| `id` | UUID (PK) | Auto-generated |
| `name` | String | Company name |
| `funds_cents` | BigInteger | Financial balance in cents |
**Design choice**: Funds stored in cents (integer) to avoid floating-point rounding errors in financial calculations. BigInteger supports very large/negative values.
### CompanyPrestige (`models/company.py`)
| Column | Type | Notes |
|--------|------|-------|
| `company_id` | UUID (FK) | References Company |
| `domain` | String | research / inference / data_environment / training |
| `prestige_level` | Float | Range [1.0, 10.0] |
**Design choice**: Prestige is tracked per-domain rather than as a single score. This forces specialization trade-offs and creates a 4-dimensional progression space.
### Employee (`models/employee.py`)
| Column | Type | Notes |
|--------|------|-------|
| `id` | UUID (PK) | Auto-generated |
| `company_id` | UUID (FK) | References Company |
| `name` | String | Employee name |
| `tier` | String | junior / mid / senior |
| `work_hours_per_day` | Float | Hours available per business day |
| `salary_cents` | BigInteger | Monthly salary in cents |
### EmployeeSkillRate (`models/employee.py`)
| Column | Type | Notes |
|--------|------|-------|
| `employee_id` | UUID (FK) | References Employee |
| `domain` | String | One of 4 domains |
| `rate_domain_per_hour` | Float | Work units produced per hour |
**Design choice**: Skill rates are **hidden from the agent**. The agent sees tier and salary but not per-domain effectiveness. This creates an information asymmetry puzzle -- the agent must infer employee strengths from task outcomes.
### Task (`models/task.py`)
| Column | Type | Notes |
|--------|------|-------|
| `id` | UUID (PK) | Auto-generated |
| `company_id` | UUID (FK, nullable) | NULL = market task, set on acceptance |
| `status` | Enum | market → planned → active → completed_success / completed_fail / cancelled |
| `title` | String | Task description |
| `required_prestige` | Float | Minimum prestige needed in ALL task domains |
| `reward_funds_cents` | BigInteger | Payment on successful completion |
| `reward_prestige_delta` | Float | Prestige gained per domain on success |
| `skill_boost_pct` | Float | Employee skill rate increase on success |
| `accepted_at` | DateTime (nullable) | When task was accepted from market |
| `deadline` | DateTime (nullable) | Calculated at acceptance |
| `completed_at` | DateTime (nullable) | When task finished |
| `success` | Boolean (nullable) | True = on-time, False = late |
| `progress_milestone_pct` | Float | Tracks progress milestones (e.g., 50%) |
**Design choice**: `company_id` being nullable elegantly distinguishes market tasks (available for browsing) from accepted tasks (owned by the company).
### TaskRequirement (`models/task.py`)
| Column | Type | Notes |
|--------|------|-------|
| `task_id` | UUID (FK) | References Task |
| `domain` | String | Which domain this requirement covers |
| `required_qty` | Float | Total work units needed |
| `completed_qty` | Float | Work units completed so far |
**Design choice**: Multi-domain requirements make tasks a multi-dimensional optimization problem. A task might need work in 2-4 domains simultaneously.
### TaskAssignment (`models/task.py`)
| Column | Type | Notes |
|--------|------|-------|
| `task_id` | UUID (FK) | References Task |
| `employee_id` | UUID (FK) | References Employee |
| `assigned_at` | DateTime | When assigned |
**Design choice**: Many-to-many junction table. An employee can work on multiple tasks (throughput splits), and a task can have multiple employees (parallel progress).
### SimEvent (`models/event.py`)
| Column | Type | Notes |
|--------|------|-------|
| `id` | UUID (PK) | Deterministic (uuid5) |
| `company_id` | UUID (FK) | References Company |
| `event_type` | String | task_completed / bankruptcy / task_half / horizon_end |
| `scheduled_at` | DateTime | When event triggers |
| `payload` | JSON | Event-specific data |
| `dedupe_key` | String | Prevents duplicate events |
| `consumed` | Boolean | True after processing |
### LedgerEntry (`models/ledger.py`)
| Column | Type | Notes |
|--------|------|-------|
| `id` | UUID (PK) | Auto-generated |
| `company_id` | UUID (FK) | References Company |
| `occurred_at` | DateTime | Transaction timestamp |
| `category` | Enum | MONTHLY_PAYROLL / TASK_REWARD / TASK_FAIL_PENALTY / TASK_CANCEL_PENALTY |
| `amount_cents` | BigInteger | Signed amount (negative = cost) |
| `ref_type` | String (nullable) | Reference entity type |
| `ref_id` | UUID (nullable) | Reference entity ID |
**Design choice**: Immutable append-only ledger provides a complete financial audit trail. No entries are ever deleted or modified.
### SimState (`models/sim_state.py`)
| Column | Type | Notes |
|--------|------|-------|
| `company_id` | UUID (FK, PK) | References Company |
| `sim_time` | DateTime | Current simulation clock |
| `run_seed` | Integer | RNG seed for reproducibility |
| `horizon_end` | DateTime | When simulation ends |
| `replenish_counter` | Integer | Tracks market task replenishment |
### Scratchpad (`models/scratchpad.py`)
| Column | Type | Notes |
|--------|------|-------|
| `company_id` | UUID (FK) | References Company |
| `content` | Text | Free-form agent notes |
**Design choice**: Scratchpad survives LLM context truncation, giving the agent persistent memory across the full simulation.
## Session Management (`session.py`)
```python
session_scope(factory) → context manager
```
- Creates a scoped session with automatic commit/rollback
- Supports both SQLite (default) and PostgreSQL (via `DATABASE_URL`)
- `init_db()` creates all tables from ORM metadata
**Design choice**: Context manager pattern ensures every database operation is properly transacted, preventing partial state updates that would corrupt the simulation.

View file

@ -0,0 +1,144 @@
# Task System
**Location**: `src/yc_bench/cli/task_commands.py`, `src/yc_bench/core/eta.py`, `src/yc_bench/core/progress.py`
## Task Lifecycle
```
market ──accept──> planned ──dispatch──> active ──complete──> completed_success
│ │ completed_fail
│ │
└──cancel──> cancelled <──cancel──┘
```
### States
| Status | Meaning |
|--------|---------|
| `market` | Available for browsing, not yet accepted |
| `planned` | Accepted by company, employees can be assigned |
| `active` | Dispatched, work is progressing |
| `completed_success` | Finished on time |
| `completed_fail` | Finished late (past deadline) |
| `cancelled` | Abandoned by agent |
## Design Choices
### Two-Phase Activation (Accept → Dispatch)
Tasks go through `planned` before `active`. This separation:
1. **Allows pre-assignment**: Agent can assign employees before starting the clock
2. **Deadline starts at accept**: Creates urgency -- planning time counts against the deadline
3. **Forces commitment**: Accepting a task reserves it but the agent must still dispatch
### Deadline Calculation
```
deadline = accepted_at + max(required_qty[d] for all domains d) / deadline_qty_per_day
```
**Design choice**: Deadline is proportional to the largest single-domain requirement, not the sum. This means multi-domain tasks don't get proportionally more time -- they require parallel work.
### Prestige Gating at Accept Time
```python
def task_accept(task_id):
for domain in task.requirements:
if company_prestige[domain] < task.required_prestige:
reject("Insufficient prestige in {domain}")
```
**Design choice**: Prestige check is per-domain. A task requiring prestige 3.0 with requirements in `research` and `inference` needs prestige >= 3.0 in BOTH domains. This prevents gaming by maxing one domain.
### Cancel Penalties
Cancelling an active task incurs:
- Prestige penalty: `reward_prestige_delta × cancel_multiplier` (configurable per difficulty)
- No financial penalty (just lost opportunity)
**Design choice**: Cancel penalties prevent the strategy of accepting everything and dropping what's inconvenient. Higher difficulties increase the cancel multiplier.
## Employee Assignment
### Assignment Rules
- Employees can only be assigned to `planned` or `active` tasks
- An employee can work on multiple tasks simultaneously (throughput splits)
- Multiple employees can work on the same task (parallel progress)
### Throughput Splitting
```
effective_rate = base_rate_per_hour / num_active_tasks
```
**Design choice**: Linear throughput splitting creates a fundamental trade-off:
- **Focus**: 1 employee on 1 task = full speed
- **Parallel**: 1 employee on 3 tasks = 1/3 speed each
- The agent must decide between fast completion of few tasks vs. slow progress on many
## Progress Tracking (`progress.py`)
### How Work Gets Done
Progress is calculated lazily during `advance_time()`:
```python
for each active task:
for each assigned employee:
for each domain in task requirements:
work = employee.skill_rate[domain] / num_active_tasks × business_hours
requirement.completed_qty += work
requirement.completed_qty = min(completed_qty, required_qty)
```
### Multi-Domain Completion
A task is complete when ALL domain requirements reach `completed_qty >= required_qty`. The slowest domain is the bottleneck.
**Design choice**: This creates interesting optimization puzzles. If a task needs 100 units of research and 50 units of training, the agent should allocate more research-skilled employees to balance completion times.
## ETA Solver (`eta.py`)
### Completion Time Calculation
```python
def solve_task_completion_time(task, assignments, sim_time):
for each domain d:
remaining = required_qty[d] - completed_qty[d]
rate = sum(effective_rate[emp][d] for emp in assignments)
if rate == 0:
return infinity # no one can work on this domain
hours_needed[d] = remaining / rate
max_hours = max(hours_needed.values())
return sim_time + max_hours (in business hours)
```
### Halfway Time Calculation
Used for milestone events. Finds the time when weighted average across domains reaches 50%.
### When ETAs Are Recalculated
- Task dispatched (new active task)
- Employee assigned/unassigned
- Task completed (frees employee throughput for other tasks)
- Task cancelled (same)
**Design choice**: Dynamic ETA recalculation ensures events are always accurate. When an employee is reassigned, all affected tasks get new completion projections.
## Market Task Generation
See [09_configuration.md](09_configuration.md) for details on how market tasks are generated with stratified prestige distribution and randomized requirements.
### Browsing and Filtering
The `market browse` command supports:
- Domain filter
- Prestige range filter
- Reward range filter
- Pagination (offset/limit)
All output is JSON for agent consumption.

View file

@ -0,0 +1,123 @@
# Prestige System
**Location**: `src/yc_bench/db/models/company.py` (CompanyPrestige), `src/yc_bench/core/engine.py` (decay), `src/yc_bench/core/handlers/task_complete.py` (rewards/penalties)
## Overview
Prestige is YC-Bench's core progression mechanic. It controls access to higher-tier tasks (which offer better rewards) and decays over time, forcing continuous engagement.
## Design Choices
### Per-Domain Prestige (4 Independent Tracks)
```
research: ████████░░ (8.0)
inference: ██████░░░░ (6.0)
data_environment: ███░░░░░░░ (3.0)
training: █████░░░░░ (5.0)
```
**Why 4 domains?** This creates a 4-dimensional strategic space:
- The agent can't max all domains simultaneously (decay + limited employees)
- Specialization unlocks high-tier tasks in 1-2 domains
- Diversification provides resilience but slower progression
- Multi-domain tasks require balanced prestige across their domains
### Prestige Range: [1.0, 10.0]
| Level | Meaning |
|-------|---------|
| 1.0 | Minimum (starting/decayed) |
| 3.0-4.0 | Mid-tier tasks accessible |
| 7.0-8.0 | High-tier tasks accessible |
| 10.0 | Maximum (hard cap) |
**Design choice**: The 1-10 range is intuitive and provides enough granularity for meaningful gating tiers without over-complicating the system.
## Prestige Gain
On successful task completion (on-time):
```
for each domain in task.requirements:
company_prestige[domain] += task.reward_prestige_delta
company_prestige[domain] = min(prestige, 10.0) # cap
```
**Design choice**: Prestige gain is per-domain and tied to the task's requirements. Completing a research+inference task only boosts those two domains, not training or data_environment.
### Prestige Scaling of Rewards
```
actual_reward = base_reward × (1 + reward_prestige_scale × (prestige - 1))
```
Higher prestige in a domain means better financial returns from tasks in that domain. This creates a virtuous cycle: more prestige → more money → more capacity → more prestige.
## Prestige Loss
### Decay (Daily)
```
prestige -= decay_per_day × days_elapsed
prestige = max(prestige, 1.0) # floor
```
Default decay rate: -0.005/day. This is slow enough to not punish short gaps but fast enough that inactive domains eventually return to baseline.
**Design choice**: Continuous decay prevents "build once, exploit forever" strategies. The agent must continuously complete tasks in a domain to maintain access.
### Failure Penalty
On late task completion:
```
for each domain in task.requirements:
company_prestige[domain] -= task.reward_prestige_delta × fail_multiplier
company_prestige[domain] = max(prestige, 1.0)
```
Default `fail_multiplier`: 0.8. Late completion costs almost as much prestige as success would have gained.
### Cancel Penalty
On task cancellation:
```
for each domain in task.requirements:
company_prestige[domain] -= task.reward_prestige_delta × cancel_multiplier
company_prestige[domain] = max(prestige, 1.0)
```
Cancel multipliers vary by difficulty (higher on hard/nightmare).
## Prestige Gating
Tasks have a `required_prestige` field. At task acceptance:
```python
for domain in task.requirements:
if company_prestige[domain] < task.required_prestige:
reject() # must meet prestige in ALL task domains
```
**Design choice**: Per-domain gating means a task with `required_prestige=5.0` and requirements in research + training needs prestige >= 5.0 in BOTH research AND training. This prevents gaming.
### Stratified Market Tasks
The first 10 market tasks are always prestige-1 (accessible immediately). Higher prestige tasks are introduced with stratified distribution. This ensures:
- The agent always has something to work on initially
- Progression is visible (new tasks unlock as prestige grows)
- No dead-end states where the agent can't accept any task
## Strategic Implications
The prestige system creates several key strategic tensions:
1. **Specialize vs. Diversify**: Focus on 1-2 domains for deep access, or spread across all 4?
2. **Risk vs. Reward**: High-prestige tasks pay more but failure costs more prestige
3. **Maintenance vs. Growth**: Should the agent keep working in mastered domains (maintenance) or push new ones (growth)?
4. **Accept vs. Defer**: Taking a task you might fail risks prestige loss; waiting risks decay
These tensions make the benchmark more than just "do tasks fast" -- it tests genuine strategic reasoning.

View file

@ -0,0 +1,162 @@
# Financial Model
**Location**: `src/yc_bench/db/models/ledger.py`, `src/yc_bench/cli/finance_commands.py`, `src/yc_bench/cli/report_commands.py`, `src/yc_bench/core/handlers/`
## Overview
The financial model simulates a startup's cash flow: revenue from completed tasks, costs from employee payroll, and penalties for failures. Running out of money triggers bankruptcy and ends the simulation.
## Design Choices
### Cents-Based Integer Arithmetic
All financial values are stored as `BigInteger` in cents:
```
$1,000.00 = 100_000 cents
```
**Why cents?** Floating-point arithmetic introduces rounding errors that compound over hundreds of transactions. Integer cents guarantee exact financial accounting -- critical for a deterministic benchmark.
### Immutable Append-Only Ledger
Every financial transaction creates a `LedgerEntry` that is never modified or deleted:
```python
class LedgerEntry:
category: MONTHLY_PAYROLL | TASK_REWARD | TASK_FAIL_PENALTY | TASK_CANCEL_PENALTY
amount_cents: int # negative for costs, positive for revenue
occurred_at: datetime
ref_type: str # optional reference to source entity
ref_id: UUID # optional reference ID
```
**Why immutable?** An append-only ledger provides:
- Complete audit trail for debugging
- Ability to reconstruct balance at any point in time
- No risk of silent data corruption
- Natural fit for the `finance ledger` and `report monthly` CLI commands
## Revenue Sources
### Task Rewards
On successful (on-time) completion:
```
reward = base_reward × (1 + prestige_scale × (avg_prestige - 1))
```
Where `avg_prestige` is averaged across the task's required domains. Higher prestige = higher payouts.
**Design choice**: Prestige-scaled rewards create a positive feedback loop that mirrors real business dynamics -- reputation leads to better opportunities.
### Revenue Timing
Rewards are credited immediately upon task completion (when the `task_completed` event fires with `success=True`).
## Cost Sources
### Monthly Payroll
Payroll is deducted on the **first business day** of each month:
```
total_payroll = sum(employee.salary_cents for all employees)
```
**Design choice**: Monthly payroll creates predictable but unavoidable costs. The agent must maintain positive cash flow to cover it.
### Salary Bumps
Each completed task increases salaries:
```
for each assigned employee:
salary_cents *= 1.01 # 1% increase per completion
```
**Design choice**: Compounding salary increases mean success has a hidden cost. Long-running simulations see payroll grow substantially, creating late-game financial pressure even as task rewards scale with prestige.
### Failure Penalties
Late task completion incurs no direct financial penalty beyond the missed reward opportunity. However, the prestige loss from failure reduces future reward scaling.
### Cancel Penalties
Cancellation may incur a financial penalty depending on configuration (some presets charge a fraction of the reward).
## Payroll-Event Tie-Breaking
When payroll and events fall on the same timestamp:
```
Payroll is processed BEFORE events
```
**Design choice**: This ordering is critical. If a task completes on the same day as payroll:
1. Payroll deducts first (may push funds negative)
2. Task completion reward credits (may save from bankruptcy)
3. Bankruptcy check happens after both
This gives the agent the benefit of the doubt -- a task completing on payday can save the company.
## Bankruptcy
Bankruptcy triggers when `funds_cents < 0` after payroll processing:
```python
if company.funds_cents < 0:
insert_bankruptcy_event(session, company_id, sim_time)
```
**Design choice**: Bankruptcy is checked only after payroll (not after penalties). This simplifies the model and makes payroll the primary survival constraint.
### Bankruptcy as Terminal State
Once bankruptcy fires, the simulation ends. There is no recovery mechanic.
**Why no bailout?** The benchmark tests whether the agent can sustainably manage a business. Allowing recovery would dilute this signal.
## Financial Reports
### Ledger Query (`finance ledger`)
The agent can query the full transaction history with filters:
- Category filter
- Date range filter
- Pagination
### Monthly P&L (`report monthly`)
Aggregates transactions by month:
```
Month Revenue Payroll Penalties Net
2025-01 $50,000 $30,000 $0 $20,000
2025-02 $35,000 $30,300 $5,000 -$300
```
**Design choice**: Structured financial reporting gives the agent the data it needs to make informed decisions about task selection and resource allocation.
## Runway Calculation
The `company status` command includes a runway estimate:
```
runway_months = funds_cents / monthly_payroll_cents
```
This helps the agent gauge urgency. Low runway signals that the agent needs profitable tasks quickly.
## Difficulty Scaling
Financial pressure scales with difficulty preset:
| Preset | Initial Funds | Payroll Pressure | Penalties |
|--------|--------------|-----------------|-----------|
| tutorial | Very high | Low | Minimal |
| easy | High | Moderate | Low |
| medium | Moderate | Moderate | Standard |
| hard | Low | High | 1.5x |
| nightmare | Very low | Very high | 2x |

View file

@ -0,0 +1,143 @@
# Employee Model
**Location**: `src/yc_bench/db/models/employee.py`, `src/yc_bench/services/generate_employees.py`, `src/yc_bench/core/progress.py`
## Overview
Employees are the company's productive resources. Each has a tier, salary, and hidden per-domain skill rates. The agent must figure out who is good at what through observation and assign them optimally.
## Design Choices
### Hidden Skill Rates (Information Asymmetry)
The agent sees:
- Employee name, tier (junior/mid/senior), salary
- Which tasks they're currently assigned to
The agent does NOT see:
- Per-domain skill rates (`rate_domain_per_hour`)
- Actual work output per hour
**Why hidden?** This is a core benchmark design decision:
1. **Tests inference ability**: The agent must infer strengths from task completion patterns
2. **Mirrors reality**: Real managers don't have exact productivity metrics for every skill dimension
3. **Creates learning opportunity**: Early task assignments serve as "probes" to discover team capabilities
4. **Rewards memory**: Agents that remember past performance can make better future assignments
### Tier System
| Tier | Typical Rate Range | Salary Range |
|------|-------------------|--------------|
| junior | Low | Low |
| mid | Medium | Medium |
| senior | High | High |
**Design choice**: Tiers provide a rough signal. Seniors are generally better but not always in every domain. A junior might excel in one domain while a senior is mediocre there. The tier-salary correlation creates a cost-benefit trade-off.
### Per-Domain Skill Rates
Each employee has 4 skill rates (one per domain):
```python
class EmployeeSkillRate:
domain: str # research, inference, data_environment, training
rate_domain_per_hour: float # work units produced per business hour
```
Rates are generated from configurable distributions (triangular, beta, etc.) during world seeding. Some employees are specialists (high in one domain, low in others); some are generalists.
**Design choice**: The 4-rate vector per employee creates a rich assignment optimization space. Optimal assignment requires matching employee strengths to task domain requirements.
## Throughput Splitting
When an employee works on multiple active tasks simultaneously:
```
effective_rate = base_rate / num_active_tasks
```
**Design choice**: Linear splitting (not diminishing returns or context-switching penalties) was chosen for simplicity and predictability. The agent can reason about it without hidden costs.
### Example
Employee Alice has `research_rate = 2.0/hr`:
- Assigned to 1 task: contributes 2.0 research units/hour
- Assigned to 3 tasks: contributes 0.67 research units/hour to each
### Implication for Strategy
The agent faces a fundamental trade-off:
- **Focused assignment**: 1 employee → 1 task = fastest completion but no parallelism
- **Spread assignment**: 1 employee → N tasks = slower per task but progress on multiple fronts
- **Optimal**: Match the strategy to deadline pressure and task urgency
## Skill Growth
On successful task completion, assigned employees get a skill boost:
```python
for each assigned employee:
for each domain in task.requirements:
skill_rate[domain] *= (1 + task.skill_boost_pct / 100)
```
**Design choice**: Skill growth compounds over time. Early investments in employee development pay off later through faster task completion. This creates a "training vs. exploiting" tension.
### Salary Bumps (Hidden Cost of Growth)
Each task completion also increases salaries:
```python
for each assigned employee:
salary_cents *= 1.01 # 1% increase
```
**Design choice**: Salary bumps mean that experienced employees cost more. The agent can't infinitely scale employee productivity without also scaling costs. After many completions, payroll may become a significant burden.
## Employee Generation (`generate_employees.py`)
### Process
1. Generate 10 employees per company (configurable)
2. Assign tiers based on configured distribution (e.g., 30% junior, 40% mid, 30% senior)
3. For each employee, generate 4 skill rates from per-tier distributions
4. Set salary based on tier bracket
### Distribution Types
Skill rates are drawn from configurable distributions:
- **Triangular**: min/mode/max (default -- creates realistic bell-curve-like distributions)
- **Beta**: alpha/beta parameters (useful for skewed distributions)
- **Normal**: mean/std (truncated to positive values)
- **Uniform**: low/high
- **Constant**: fixed value
**Design choice**: Configurable distributions allow difficulty presets to create different workforce profiles. Tutorial mode might use tight distributions (predictable employees), while nightmare mode uses wide distributions (unpredictable).
## Employee Visibility to Agent
The `employee list` CLI command returns:
```json
{
"employees": [
{
"id": "uuid",
"name": "Alice Chen",
"tier": "senior",
"salary": "$8,000/mo",
"active_tasks": 2
}
]
}
```
Note: no skill rates, no per-domain breakdown, no historical performance. The agent must build this knowledge through experience.
## Strategic Considerations
1. **Discovery phase**: Early on, assign different employees to different domain tasks to learn strengths
2. **Specialization**: Once strengths are known, match employees to their best domains
3. **Load balancing**: Avoid overloading one employee (throughput splitting penalty)
4. **Growth investment**: Assign employees to tasks in domains where they need improvement
5. **Cost awareness**: Track which employees have had many salary bumps

View file

@ -0,0 +1,243 @@
# Agent Layer
**Location**: `src/yc_bench/agent/`
## Overview
The agent layer connects an LLM to the simulation via a tool-use interface. It manages the conversation loop, prompt construction, tool execution, and run state tracking.
## Architecture
```
┌─────────────────────────┐
│ Agent Loop │
│ (loop.py) │
├─────────────────────────┤
│ ┌──────────┐ ┌──────┐ │
│ │ Prompt │ │ Tools │ │
│ │ Builder │ │ │ │
│ └──────────┘ └──────┘ │
├─────────────────────────┤
│ LLM Runtime │
│ (runtime/) │
│ LiteLLM abstraction │
├─────────────────────────┤
│ Run State / Transcript │
│ (run_state.py) │
└─────────────────────────┘
```
## Design Choices
### LiteLLM as LLM Abstraction (`runtime/`)
The agent uses [LiteLLM](https://github.com/BerriAI/litellm) to abstract away vendor differences:
```python
# Supports: Anthropic, OpenAI, OpenRouter, Google Gemini, etc.
response = litellm.completion(
model="anthropic/claude-sonnet-4-20250514",
messages=messages,
tools=tools,
)
```
**Why LiteLLM?**
- Single interface for all major LLM providers
- Consistent tool-use format across providers
- Easy to benchmark different models on the same scenarios
- Handles auth, retries, and format conversion
### Tool-Use Interface (Not Text Parsing)
The agent interacts via structured tool calls, not text command parsing:
```json
{
"name": "run_command",
"arguments": {
"command": "yc-bench task list --status active"
}
}
```
**Why tool-use?**
- Eliminates parsing ambiguity
- Works with all modern LLMs' native tool-use
- Structured output from CLI commands (JSON) flows cleanly back
- Reduces error rate vs. free-text command generation
### Available Tools
#### `run_command`
Executes CLI commands in a subprocess. The agent can run any `yc-bench` CLI command.
```python
def run_command(command: str) -> str:
"""Execute a yc-bench CLI command and return output."""
```
**Design choice**: Subprocess execution provides isolation. The agent can't accidentally modify simulation state outside of defined CLI commands.
#### `python_repl` (Optional)
A persistent Python interpreter for calculations and data analysis.
```python
def python_repl(code: str) -> str:
"""Execute Python code and return output."""
```
**Design choice**: Some agents benefit from being able to compute (e.g., calculate optimal assignments, project cash flow). This tool is optional and configurable.
## Agent Loop (`loop.py`)
### Main Loop
```python
def run_agent_loop(runtime, session, company_id, cfg):
while not terminal:
# Build messages (system prompt + history)
messages = build_messages(history, context)
# Call LLM
response = runtime.completion(messages, tools)
# Process tool calls
for tool_call in response.tool_calls:
result = execute_tool(tool_call)
history.append(tool_call, result)
# Check for terminal conditions
if is_terminal(result):
break
# Auto-resume if agent hasn't advanced simulation
if turns_since_resume > max_turns_without_resume:
force_resume()
```
### Design Choices in the Loop
#### History Truncation
```python
# Keep only last N turns to fit context window
messages = system_prompt + history[-max_history_turns:]
```
**Why truncate?** Long simulations generate hundreds of turns. Without truncation, the context would exceed any model's window. The scratchpad CLI command compensates for lost history.
#### Auto-Resume Forcing
If the agent doesn't call `yc-bench sim resume` for N turns, the loop forces one:
```python
if turns_since_resume > cfg.loop.max_turns_without_resume:
result = execute("yc-bench sim resume")
```
**Why force?** Some models get stuck in analysis loops, repeatedly querying state without advancing. Auto-resume prevents infinite loops and ensures forward progress.
#### Turn Budget
The loop has a maximum turn count. This prevents runaway agents and bounds benchmark cost.
## Prompt Construction (`prompt.py`)
### System Prompt Structure
```
1. Role description ("You are the CEO of an AI startup...")
2. Available commands reference
3. Current company status summary
4. Strategic guidance (domain, prestige, deadlines)
5. Constraints and rules
```
**Design choice**: The system prompt provides enough context for the agent to understand its role without revealing internal mechanics (like hidden skill rates or exact formulas).
### Context Building
Each turn, the prompt may include:
- Wake events from the last `sim resume`
- Current funds and runway
- Active task count and approaching deadlines
- Prestige levels
This contextual information helps the agent make informed decisions without needing to query every turn.
## Run State (`run_state.py`)
### Transcript Recording
Every turn is recorded:
```python
{
"turn": 42,
"messages": [...],
"tool_calls": [...],
"tool_results": [...],
"timestamp": "2025-03-15T10:30:00",
"tokens_used": 1500
}
```
**Design choice**: Full transcripts enable:
- Post-hoc analysis of agent strategy
- Debugging agent failures
- Benchmark scoring based on decision quality
- Comparison across models
### Output Format
The final rollout is saved as JSON:
```json
{
"model": "anthropic/claude-sonnet-4-20250514",
"seed": 42,
"config": "medium",
"outcome": "horizon_end",
"final_funds": 250000,
"final_prestige": {"research": 7.2, ...},
"turns": 187,
"transcript": [...]
}
```
## Command Execution Policy (`commands/`)
### Command Allowlist
The agent can only execute `yc-bench` CLI commands. Arbitrary shell commands are blocked.
**Design choice**: Restricting to the CLI API ensures:
- No direct database manipulation
- No simulation state bypass
- Fair comparison across models
- Deterministic state transitions
### Error Handling
Invalid commands return structured error messages:
```json
{"error": "Task not found", "task_id": "..."}
```
**Design choice**: Structured errors help the agent understand and recover from mistakes, rather than receiving opaque stack traces.
## Retry and Timeout Logic
```python
# Exponential backoff for LLM API calls
for attempt in range(max_retries):
try:
response = runtime.completion(messages, tools)
break
except RateLimitError:
wait(2 ** attempt)
```
**Design choice**: LLM APIs are unreliable. Retry logic ensures transient failures don't corrupt benchmark runs.

View file

@ -0,0 +1,173 @@
# CLI Interface
**Location**: `src/yc_bench/cli/`
## Overview
The CLI is the agent's sole interface to the simulation. Every command returns structured JSON, enabling reliable parsing by LLMs.
## Design Choices
### JSON-Only Output
All CLI commands return JSON, never free-text:
```bash
$ yc-bench company status
{
"company_name": "Nexus AI",
"funds": "$150,000.00",
"funds_cents": 15000000,
"monthly_payroll": "$30,000.00",
"runway_months": 5.0,
"prestige": {
"research": 3.5,
"inference": 2.1,
"data_environment": 1.0,
"training": 4.2
}
}
```
**Why JSON?**
- Unambiguous parsing by LLMs (vs. formatted tables)
- Consistent structure across all commands
- Easy to pipe into `python_repl` for analysis
- Machine-readable without regex or text parsing
### Command Group Organization
| Group | File | Purpose |
|-------|------|---------|
| `company` | `company_commands.py` | Company status, prestige overview |
| `employee` | `employee_commands.py` | Employee listing and details |
| `market` | `market_commands.py` | Browse available tasks |
| `task` | `task_commands.py` | Task lifecycle (accept/assign/dispatch/cancel/inspect/list) |
| `sim` | `sim_commands.py` | Simulation control (resume) |
| `finance` | `finance_commands.py` | Ledger queries |
| `report` | `report_commands.py` | Monthly P&L reports |
| `scratchpad` | `scratchpad_commands.py` | Persistent agent memory |
**Design choice**: Command groups mirror real business functions (operations, HR, finance, strategy). This makes the interface intuitive for LLM agents that have been trained on business concepts.
## Command Details
### Company Commands
#### `company status`
Returns current funds, payroll, runway, and prestige levels per domain.
**Design choice**: Single command gives the agent a complete financial and strategic snapshot. Reduces the number of API calls needed per decision cycle.
### Employee Commands
#### `employee list`
Returns all employees with tier, salary, and current active task count.
**Design choice**: Shows active task count but NOT skill rates. The agent must infer capabilities.
### Market Commands
#### `market browse [--domain X] [--min-prestige N] [--max-prestige N] [--offset O] [--limit L]`
Browse available market tasks with optional filters.
**Design choice**: Filtering and pagination prevent information overload. The agent can focus on tasks matching its current prestige level and strategic goals.
### Task Commands
#### `task accept <task_id>`
Accept a market task. Validates prestige requirements. Sets deadline.
#### `task assign <task_id> <employee_id>`
Assign an employee to a planned/active task. Recalculates ETAs.
#### `task dispatch <task_id>`
Start work on a planned task. Changes status to active.
#### `task cancel <task_id>`
Cancel a task. Applies prestige penalty. Frees employees.
#### `task inspect <task_id>`
Detailed view of a single task: requirements, progress, assignments, deadline.
#### `task list [--status X]`
List company tasks with optional status filter.
**Design choice**: The accept → assign → dispatch flow gives the agent explicit control over each phase. This mirrors real project management where you scope, staff, and then kick off work.
### Simulation Commands
#### `sim resume`
Advance simulation to the next event. Returns wake events.
```json
{
"advanced_to": "2025-02-15T09:00:00",
"wake_events": [
{"type": "task_completed", "task_id": "...", "success": true},
{"type": "payroll", "amount": -3000000}
]
}
```
**Design choice**: Resume is the only way to advance time. The agent explicitly chooses when to move forward, creating natural decision checkpoints.
### Finance Commands
#### `finance ledger [--category X] [--from DATE] [--to DATE] [--offset O] [--limit L]`
Query the immutable transaction history.
**Design choice**: Full ledger access lets sophisticated agents analyze spending patterns and project future cash flow.
### Report Commands
#### `report monthly`
Aggregated P&L by month.
**Design choice**: Monthly reports provide a higher-level financial view than raw ledger entries, useful for strategic planning.
### Scratchpad Commands
#### `scratchpad read`
Read persistent notes.
#### `scratchpad write <content>`
Overwrite scratchpad contents.
#### `scratchpad append <content>`
Add to existing scratchpad.
#### `scratchpad clear`
Clear scratchpad.
**Design choice**: The scratchpad is critical for long simulations where LLM context gets truncated. The agent can store:
- Employee capability observations
- Strategic plans
- Financial projections
- Task priority lists
This compensates for context window limitations and tests whether the agent proactively maintains external memory.
## Error Handling
All commands return structured errors:
```json
{
"error": "Insufficient prestige in research (have 2.3, need 4.0)"
}
```
**Design choice**: Descriptive error messages help the agent understand what went wrong and adjust its strategy, rather than failing silently or with cryptic messages.
## CLI Entry Point (`__main__.py`)
The CLI uses a command-line parser (likely Click or argparse) to route commands to handler functions. Each handler:
1. Opens a database session
2. Validates inputs
3. Performs the operation
4. Returns JSON output
5. Commits or rolls back the transaction
**Design choice**: Each CLI call is a self-contained transaction. This prevents partial state updates and ensures the simulation remains consistent.

View file

@ -0,0 +1,203 @@
# Configuration System
**Location**: `src/yc_bench/config/`
## Overview
The configuration system uses Pydantic models validated from TOML preset files. It controls every aspect of the simulation: world generation parameters, difficulty tuning, agent behavior, and distribution specifications.
## Design Choices
### Pydantic Schema (`schema.py`)
The configuration hierarchy:
```
ExperimentConfig
├── AgentConfig # LLM model, tools, retry settings
├── LoopConfig # Turn budget, auto-resume threshold
├── SimConfig # Simulation parameters
└── WorldConfig # World generation parameters
├── CompanyConfig # Initial funds, starting prestige
├── EmployeeConfig # Team size, tier distribution, salary ranges
├── TaskConfig # Task count, domain requirements, deadlines
└── PrestigeConfig # Decay rate, penalty multipliers, scaling
```
**Why Pydantic?**
- Type validation at load time (catch config errors early)
- Default values with optional overrides
- Discriminated unions for distribution specs
- Clear documentation through type annotations
- Serialization to/from TOML/JSON
### TOML Preset Files (`presets/`)
```toml
# medium.toml
[world]
initial_funds_cents = 500_000_00
[world.prestige]
decay_per_day = 0.005
penalty_fail_multiplier = 0.8
penalty_cancel_multiplier = 1.0
[world.tasks]
count = 200
deadline_qty_per_day = 11.0
[world.tasks.reward_funds]
type = "triangular"
min = 5000_00
mode = 15000_00
max = 50000_00
```
**Why TOML?** Human-readable, supports comments, natural hierarchy via sections, widely supported in Python. Better than JSON for config files (comments), simpler than YAML (fewer gotchas).
### Preset Hierarchy
| Preset | Focus | Key Characteristics |
|--------|-------|-------------------|
| `default.toml` | Base | All defaults; other presets override selectively |
| `tutorial.toml` | Learning | Relaxed deadlines, prestige-1 tasks only, high funds |
| `easy.toml` | Casual | Relaxed deadlines, flat prestige requirements |
| `medium.toml` | Standard | Prestige climbing, 2-domain tasks, 9-day deadlines |
| `hard.toml` | Challenge | Prestige gating active, 7-day deadlines, 1.5x cancel penalty |
| `nightmare.toml` | Extreme | Razor-thin margins, 6-day deadlines, 2x penalties |
**Design choice**: Preset-based difficulty rather than a single "difficulty slider" allows fine-grained control. Each preset can tune dozens of independent parameters.
### Config Loading (`loader.py`)
```python
def load_config(preset_name: str) -> ExperimentConfig:
base = load_toml("default.toml")
overlay = load_toml(f"{preset_name}.toml")
merged = deep_merge(base, overlay)
return ExperimentConfig(**merged)
```
**Design choice**: Config inheritance via deep merge. Presets only specify what differs from default, keeping preset files concise and maintainable.
## Distribution Specifications (`sampling.py`)
### The DistSpec System
Many world generation parameters use statistical distributions rather than fixed values:
```python
class DistSpec(BaseModel):
"""Discriminated union of distribution types."""
type: Literal["triangular", "beta", "normal", "uniform", "constant"]
# Parameters vary by type
```
**Supported distributions:**
| Type | Parameters | Use Case |
|------|-----------|----------|
| `triangular` | min, mode, max | Task rewards, skill rates (natural asymmetric bell curve) |
| `beta` | alpha, beta, scale | Prestige requirements (skewed toward low values) |
| `normal` | mean, std | Symmetric variation around a target |
| `uniform` | low, high | Equal probability across range |
| `constant` | value | Fixed value (no randomness) |
**Why discriminated unions?** Pydantic validates the correct parameters for each distribution type at load time. Invalid combinations (e.g., triangular with alpha parameter) are caught before the simulation runs.
### Usage Example
```toml
[world.tasks.reward_funds]
type = "triangular"
min = 5000_00
mode = 15000_00
max = 50000_00
[world.employees.junior_rate]
type = "beta"
alpha = 2.0
beta = 5.0
scale = 3.0
```
## World Generation
### Seeding (`services/seed_world.py`)
```python
def seed_world_transactional(session, cfg, seed):
rng = create_rng(seed)
company = create_company(session, cfg.world.company)
employees = generate_employees(session, company, cfg.world.employees, rng)
tasks = generate_tasks(session, cfg.world.tasks, rng)
sim_state = create_sim_state(session, company, cfg.sim, seed)
```
**Design choice**: Single-transaction world seeding ensures atomic creation. Either the entire world is created or nothing is -- no partial states.
### Employee Generation (`services/generate_employees.py`)
1. Generate N employees (default 10)
2. Assign tiers from configured distribution (e.g., 30/40/30 junior/mid/senior)
3. For each employee, sample 4 skill rates from per-tier distributions
4. Set salary based on tier range
### Task Generation (`services/generate_tasks.py`)
1. Generate M tasks (default 200+)
2. First 10 tasks are always prestige-1 (guaranteed accessible)
3. Remaining tasks have stratified prestige requirements
4. Each task gets 2-4 domain requirements sampled from distributions
5. Rewards scale with prestige and task size
**Design choice**: Stratified generation ensures:
- The agent always has starting tasks (prestige-1 guaranteed)
- Tasks span the full prestige range (progression is possible)
- No prestige "dead zones" where no tasks exist
### RNG Management (`services/rng.py`)
```python
def create_rng(seed: int) -> numpy.random.Generator:
return numpy.random.default_rng(seed)
```
**Design choice**: Centralized RNG with explicit seed ensures full reproducibility. Same seed → same world → same event sequence (given same agent actions).
## Key Configuration Parameters
### Financial Tuning
| Parameter | Default | Effect |
|-----------|---------|--------|
| `initial_funds_cents` | 500,000 | Starting capital |
| `reward_prestige_scale` | 0.15 | How much prestige amplifies rewards |
| `salary_bump_pct` | 1.0 | Per-completion salary increase |
### Prestige Tuning
| Parameter | Default | Effect |
|-----------|---------|--------|
| `prestige_decay_per_day` | 0.005 | Daily prestige loss |
| `penalty_fail_multiplier` | 0.8 | Prestige cost of late completion |
| `penalty_cancel_multiplier` | 1.0 | Prestige cost of cancellation |
| `prestige_min` | 1.0 | Floor value |
| `prestige_max` | 10.0 | Ceiling value |
### Task Tuning
| Parameter | Default | Effect |
|-----------|---------|--------|
| `deadline_qty_per_day` | 11.0 | Deadline generosity |
| `num_domains_per_task` | 2-4 | Multi-domain complexity |
| `progress_milestone_pct` | 50 | When to fire halfway event |
### Agent Tuning
| Parameter | Default | Effect |
|-----------|---------|--------|
| `max_turns` | 500 | Hard turn limit |
| `max_turns_without_resume` | 5 | Auto-resume threshold |
| `history_truncation` | 50 | Turns kept in context |

View file

@ -0,0 +1,232 @@
# Runner & Orchestration
**Location**: `src/yc_bench/runner/`
## Overview
The runner is the top-level orchestration layer that ties everything together: parsing arguments, loading configuration, initializing the database, seeding the world, starting the agent loop, and collecting results.
## Components
### Entry Point (`main.py`)
```python
def run_benchmark(args):
# 1. Load configuration
cfg = load_config(args.config)
# 2. Initialize database
engine, factory = init_db(db_path)
# 3. Seed world
with session_scope(factory) as session:
seed_world_transactional(session, cfg, args.seed)
# 4. Build agent runtime
runtime = build_runtime(cfg.agent, args.model)
# 5. Start dashboard (if TTY)
dashboard = Dashboard(cfg) if is_tty() else None
# 6. Run agent loop
result = run_agent_loop(runtime, factory, cfg, dashboard)
# 7. Save results
save_rollout(result, args.output)
```
### Design Choices
#### Single-Command Invocation
```bash
uv run yc-bench run --model gemini/gemini-3-flash --seed 1 --config medium
```
**Why single command?** Benchmarks should be easy to reproduce. One command with explicit parameters (model, seed, config) fully specifies a run.
#### Database Per Run
Each run creates a fresh SQLite database:
```
db/run_seed1_medium_2025-03-15.sqlite
```
**Why per-run databases?**
- Isolation: runs can't interfere with each other
- Inspection: can analyze any run's final state after the fact
- Reproducibility: re-running with same seed produces identical database
- Parallelism: multiple runs can execute simultaneously
## Argument Parsing (`args.py`)
### Key Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| `--model` | Yes | LLM model identifier (LiteLLM format) |
| `--seed` | Yes | Random seed for world generation |
| `--config` | No | Difficulty preset (default: "medium") |
| `--output` | No | Output path for rollout JSON |
| `--no-dashboard` | No | Disable live terminal UI |
| `--max-turns` | No | Override turn limit |
**Design choice**: Required arguments are minimal (model + seed). Everything else has sensible defaults. This reduces barrier to running benchmarks while allowing full customization.
## Dashboard (`dashboard.py`)
### Live Terminal UI
The dashboard uses [Rich](https://github.com/Textualize/rich) to display real-time simulation state:
```
┌─ YC-Bench Dashboard ──────────────────────────────┐
│ Model: claude-sonnet-4 Seed: 42 Config: medium │
│ Turn: 87/500 Sim Time: 2025-06-15 │
├────────────────────────────────────────────────────┤
│ Funds: $125,340 Runway: 4.2 months │
│ Prestige: R:5.2 I:3.8 D:2.1 T:6.4 │
│ Active Tasks: 3 Completed: 12 Failed: 1 │
├────────────────────────────────────────────────────┤
│ Last Action: task assign abc123 emp456 │
│ Last Event: task_completed (success) │
└────────────────────────────────────────────────────┘
```
**Design choice**: The dashboard is for human observers, not the agent. It provides real-time visibility into benchmark runs without affecting agent behavior.
### Features
- Live fund tracking with trend indicators
- Prestige levels per domain
- Task status counters
- Recent agent actions
- Turn counter and simulation clock
- Auto-refreshes on each turn
### Conditional Activation
Dashboard only activates when running in a TTY (interactive terminal). Redirected output or CI environments get plain log output.
**Why conditional?** Batch runs (scripts/) shouldn't have terminal UI overhead. Detecting TTY ensures the right output mode automatically.
## Session Management (`session.py`)
### Run Session
Manages the lifecycle of a single benchmark run:
```python
class RunSession:
db_path: str
config: ExperimentConfig
model: str
seed: int
start_time: datetime
def save_rollout(self, result):
"""Save final rollout JSON to results/"""
def cleanup(self):
"""Clean up temporary resources"""
```
**Design choice**: Session object encapsulates all run-specific state, making it easy to serialize and manage runs.
## Batch Running (`scripts/`)
### Multi-Seed Runs
Scripts for running the same model across multiple seeds:
```bash
# Run seeds 1-10 with claude-sonnet on medium difficulty
for seed in $(seq 1 10); do
uv run yc-bench run --model anthropic/claude-sonnet-4-20250514 --seed $seed --config medium
done
```
### Multi-Model Comparison
Scripts for comparing models on the same seeds:
```bash
for model in "anthropic/claude-sonnet-4-20250514" "openai/gpt-4o" "google/gemini-pro"; do
uv run yc-bench run --model $model --seed 42 --config medium
done
```
**Design choice**: Simple shell scripts rather than a complex orchestration framework. This keeps the benchmark tooling minimal and transparent.
## Results & Output
### Rollout JSON
Each run produces a rollout file:
```
results/
├── claude-sonnet_seed1_medium.json
├── claude-sonnet_seed2_medium.json
├── gpt-4o_seed1_medium.json
└── ...
```
### Rollout Contents
```json
{
"metadata": {
"model": "anthropic/claude-sonnet-4-20250514",
"seed": 1,
"config": "medium",
"start_time": "2025-03-15T10:00:00",
"end_time": "2025-03-15T10:45:00"
},
"outcome": "horizon_end",
"final_state": {
"funds_cents": 25000000,
"prestige": {"research": 7.2, "inference": 5.1, ...},
"tasks_completed": 24,
"tasks_failed": 3,
"tasks_cancelled": 1,
"turns_used": 187
},
"transcript": [
{"turn": 1, "action": "company status", "result": {...}},
...
]
}
```
### Plots (`plots/`)
Visualization scripts for comparing model performance:
- Funds over time
- Prestige progression per domain
- Task completion rates
- Comparison charts across models/seeds
**Design choice**: Separate plotting from the benchmark runner. Results are stored as data (JSON); visualization is a post-processing step.
## Error Recovery
### Crash Recovery
If a run crashes (LLM timeout, OOM, etc.):
- The SQLite database persists with the last consistent state
- Rollout JSON may be partial but includes transcript up to the crash
- Re-running with the same seed starts fresh (no resume from crash)
**Design choice**: No crash recovery by design. Benchmark runs should be atomic -- either complete or re-run. This prevents partial results from contaminating comparisons.
### Graceful Shutdown
On SIGINT (Ctrl+C):
- Current turn completes
- Partial rollout is saved
- Database is committed
- Dashboard is cleaned up
**Design choice**: Graceful shutdown preserves whatever data exists, useful for debugging long runs that need to be interrupted.