mirror of
https://github.com/collinear-ai/yc-bench.git
synced 2026-04-19 12:58:03 +00:00
Add system design documentation for yc-bench
Comprehensive documentation covering all major subsystems: simulation engine, data models, task system, prestige, finances, employees, agent layer, CLI interface, configuration, and runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
b1cd7ebfb2
commit
ecd3d9e415
11 changed files with 1858 additions and 0 deletions
98
system_design/00_overview.md
Normal file
98
system_design/00_overview.md
Normal file
|
|
@ -0,0 +1,98 @@
|
|||
# YC-Bench: System Overview
|
||||
|
||||
## What is YC-Bench?
|
||||
|
||||
YC-Bench is a **long-horizon deterministic benchmark for LLM agents**. It simulates an AI startup CEO managing a company over 1-3 years through a CLI-based interface against a SQLite-backed discrete-event simulation engine. The benchmark tests sustained decision-making over hundreds of turns through compounding financial, prestige, and deadline pressures.
|
||||
|
||||
## Core Premise
|
||||
|
||||
An LLM agent is dropped into the role of CEO of a small AI startup. It must:
|
||||
|
||||
- Browse and accept tasks from a marketplace
|
||||
- Assign employees to tasks across 4 technical domains
|
||||
- Manage cash flow (payroll, rewards, penalties)
|
||||
- Build prestige in each domain to unlock higher-tier tasks
|
||||
- Survive until the simulation horizon ends without going bankrupt
|
||||
|
||||
## Key Metrics (~4,975 lines of Python)
|
||||
|
||||
| Dimension | Details |
|
||||
|-----------|---------|
|
||||
| Employees | 10 (hidden per-domain skill rates) |
|
||||
| Market Tasks | 200+ (configurable) |
|
||||
| Domains | 4: research, inference, data_environment, training |
|
||||
| Prestige Range | 1.0 - 10.0 per domain |
|
||||
| Difficulty Presets | tutorial, easy, medium, hard, nightmare |
|
||||
|
||||
## High-Level Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Runner / CLI │
|
||||
│ (argument parsing, dashboard, session management) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Agent Layer │
|
||||
│ (LLM runtime, agent loop, tools, prompt building) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ CLI Command Interface │
|
||||
│ (company, employee, market, task, sim, finance, │
|
||||
│ report, scratchpad) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Simulation Engine (core/) │
|
||||
│ (event processing, ETA solving, progress tracking, │
|
||||
│ business time, prestige decay) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Data Layer (db/) │
|
||||
│ (SQLAlchemy ORM models, session management) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Configuration & World Generation │
|
||||
│ (Pydantic schemas, TOML presets, seeding, RNG) │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Directory Map
|
||||
|
||||
```
|
||||
~/yc_bench_fixed/
|
||||
├── src/yc_bench/
|
||||
│ ├── __main__.py # CLI entry point
|
||||
│ ├── agent/ # Agent runtime and loop
|
||||
│ ├── cli/ # Agent-facing CLI commands
|
||||
│ ├── core/ # Simulation engine
|
||||
│ ├── db/ # ORM models & session
|
||||
│ ├── config/ # Pydantic schemas + TOML presets
|
||||
│ ├── services/ # World generation & RNG
|
||||
│ └── runner/ # Benchmark orchestration
|
||||
├── scripts/ # Batch running scripts
|
||||
├── db/ # SQLite databases (runtime)
|
||||
├── results/ # Output JSON rollouts
|
||||
├── plots/ # Result visualizations
|
||||
├── pyproject.toml # Package definition (uv-based)
|
||||
└── uv.lock # Lock file
|
||||
```
|
||||
|
||||
## Execution Flow
|
||||
|
||||
1. User runs: `uv run yc-bench run --model <model> --seed 1 --config medium`
|
||||
2. Runner loads config, initializes DB, seeds world, starts agent loop
|
||||
3. Agent receives system prompt with company context and available CLI tools
|
||||
4. Each turn: agent calls CLI commands via `run_command` tool, optionally `python_repl`
|
||||
5. Agent calls `yc-bench sim resume` to advance simulation time
|
||||
6. Simulation processes events (completions, payroll, milestones) and returns wake events
|
||||
7. Loop continues until bankruptcy or horizon end
|
||||
8. Output: rollout JSON transcript + SQLite game state
|
||||
|
||||
## Design Documents
|
||||
|
||||
| File | Topic |
|
||||
|------|-------|
|
||||
| [01_simulation_engine.md](01_simulation_engine.md) | Core simulation engine and event processing |
|
||||
| [02_data_models.md](02_data_models.md) | Database schema and ORM design |
|
||||
| [03_task_system.md](03_task_system.md) | Task lifecycle, ETA, and progress |
|
||||
| [04_prestige_system.md](04_prestige_system.md) | Prestige mechanics, decay, and gating |
|
||||
| [05_financial_model.md](05_financial_model.md) | Funds, payroll, ledger, and bankruptcy |
|
||||
| [06_employee_model.md](06_employee_model.md) | Employee skills, throughput, and growth |
|
||||
| [07_agent_layer.md](07_agent_layer.md) | LLM runtime, agent loop, and tools |
|
||||
| [08_cli_interface.md](08_cli_interface.md) | CLI command groups and JSON output |
|
||||
| [09_configuration.md](09_configuration.md) | Config schema, presets, and world generation |
|
||||
| [10_runner_orchestration.md](10_runner_orchestration.md) | Benchmark runner, dashboard, and session |
|
||||
147
system_design/01_simulation_engine.md
Normal file
147
system_design/01_simulation_engine.md
Normal file
|
|
@ -0,0 +1,147 @@
|
|||
# Simulation Engine
|
||||
|
||||
**Location**: `src/yc_bench/core/`
|
||||
|
||||
## Design Choice: Discrete-Event Simulation
|
||||
|
||||
YC-Bench uses a **discrete-event simulation (DES)** model rather than a tick-based approach. This was chosen because:
|
||||
|
||||
1. **Determinism**: Events are processed in a fixed, reproducible order given the same seed
|
||||
2. **Efficiency**: Time jumps between events rather than iterating every hour/day
|
||||
3. **Clarity**: Each state change corresponds to a meaningful event, making the simulation auditable
|
||||
|
||||
## Core Loop (`engine.py`)
|
||||
|
||||
The `advance_time()` function is the heart of the simulation:
|
||||
|
||||
```
|
||||
advance_time(session, company_id, cfg) → AdvanceResult
|
||||
```
|
||||
|
||||
### Algorithm
|
||||
|
||||
1. **Flush progress** on all active tasks (convert elapsed business hours into completed work)
|
||||
2. **Apply prestige decay** for elapsed days
|
||||
3. **Process payroll** if crossing a month boundary (first business day)
|
||||
4. **Fetch next unconsumed event** ordered by `(scheduled_at, priority)`
|
||||
5. **Dispatch to handler** based on event type
|
||||
6. **Recalculate ETAs** for affected tasks
|
||||
7. **Update sim_time** to the event's timestamp
|
||||
8. **Return wake events** to the agent
|
||||
|
||||
### Why "Resume" Rather Than Auto-Advance?
|
||||
|
||||
The agent explicitly calls `yc-bench sim resume` to advance time. This design:
|
||||
|
||||
- Gives the agent control over pacing (plan before advancing)
|
||||
- Creates a natural decision checkpoint between simulation steps
|
||||
- Allows multiple CLI queries before committing to advancing
|
||||
- If the agent stalls (N turns without resume), the loop forces one automatically
|
||||
|
||||
## Event System (`events.py`)
|
||||
|
||||
### Event Types (Priority Order)
|
||||
|
||||
| Priority | Event Type | Trigger |
|
||||
|----------|-----------|---------|
|
||||
| 1 | `task_completed` | Task reaches 100% in all domain requirements |
|
||||
| 2 | `bankruptcy` | Funds drop below zero after payroll |
|
||||
| 3 | `task_half` | Task reaches 50% progress milestone |
|
||||
| 4 | `horizon_end` | Simulation time limit reached |
|
||||
|
||||
### Design Choice: Fixed Priority Ordering
|
||||
|
||||
Events at the same timestamp are processed in strict priority order. This ensures:
|
||||
|
||||
- Task completions (and their rewards) are processed before bankruptcy checks
|
||||
- A task finishing on the same day as payroll can save the company from bankruptcy
|
||||
- Deterministic behavior regardless of insertion order
|
||||
|
||||
### Event Identity (Deterministic UUIDs)
|
||||
|
||||
Event IDs use `uuid5` based on payload + timestamp + dedupe_key. This means:
|
||||
|
||||
- Same world state produces identical event IDs
|
||||
- Deduplication is automatic (re-inserting same event is a no-op)
|
||||
- Full reproducibility across runs with same seed
|
||||
|
||||
## Event Handlers (`handlers/`)
|
||||
|
||||
### `task_complete.py`
|
||||
- Finalizes all domain progress to 100%
|
||||
- Success check: `sim_time <= deadline`
|
||||
- On success: add reward funds, add prestige per domain, boost employee skill rates, apply 1% salary bump
|
||||
- On failure (late): apply prestige penalty per domain (configurable multiplier)
|
||||
|
||||
### `task_half.py`
|
||||
- Marks progress milestone reached
|
||||
- Informational event for agent awareness (no state changes beyond flag)
|
||||
|
||||
### `bankruptcy.py`
|
||||
- Triggered when `funds_cents < 0` after payroll
|
||||
- Terminates the simulation with bankruptcy outcome
|
||||
|
||||
### `horizon_end.py`
|
||||
- Triggered at configured simulation end date
|
||||
- Terminates the simulation with final scoring
|
||||
|
||||
## Progress Tracking (`progress.py`)
|
||||
|
||||
### Effective Rate Calculation
|
||||
|
||||
```
|
||||
effective_rate = base_rate_per_hour / num_active_tasks_for_this_employee
|
||||
```
|
||||
|
||||
**Design choice**: Throughput splitting creates a resource allocation puzzle. An employee assigned to 3 tasks works at 1/3 speed on each. The agent must balance parallelism vs. focus.
|
||||
|
||||
### Progress Flush
|
||||
|
||||
When `advance_time()` runs, it calculates work done since the last flush:
|
||||
|
||||
```
|
||||
work = effective_rate × business_hours_elapsed
|
||||
completed_qty += work (capped at required_qty)
|
||||
```
|
||||
|
||||
## Business Time (`business_time.py`)
|
||||
|
||||
### Design Choice: Business Hours Only
|
||||
|
||||
Work only happens during business hours (weekdays, configurable hours per day). This adds:
|
||||
|
||||
- Realistic scheduling constraints
|
||||
- Weekend gaps that affect deadline calculations
|
||||
- A reason for the agent to think about calendar timing
|
||||
|
||||
## ETA Solver (`eta.py`)
|
||||
|
||||
### Completion Time
|
||||
|
||||
```
|
||||
solve_task_completion_time():
|
||||
For each domain d:
|
||||
remaining[d] = required_qty[d] - completed_qty[d]
|
||||
rate[d] = sum(effective_rate for assigned employees with skill in d)
|
||||
time[d] = remaining[d] / rate[d]
|
||||
completion_time = max(time[d]) across all domains
|
||||
```
|
||||
|
||||
### Design Choice: Multi-Domain Bottleneck
|
||||
|
||||
A task completes when ALL domains finish. The slowest domain determines completion time. This creates interesting assignment puzzles where the agent must identify and address bottlenecks.
|
||||
|
||||
### Halfway Time
|
||||
|
||||
Used for progress milestone events. Calculated as the weighted midpoint across domains.
|
||||
|
||||
## Prestige Decay
|
||||
|
||||
```
|
||||
apply_prestige_decay(session, company_id, days_elapsed, cfg):
|
||||
for each domain:
|
||||
prestige -= decay_per_day × days_elapsed
|
||||
prestige = max(prestige, prestige_min) # floor at 1.0
|
||||
```
|
||||
|
||||
**Design choice**: Decay prevents "set and forget" strategies. The agent must continuously work in domains to maintain access to high-tier tasks. Neglected domains revert to baseline.
|
||||
190
system_design/02_data_models.md
Normal file
190
system_design/02_data_models.md
Normal file
|
|
@ -0,0 +1,190 @@
|
|||
# Data Models & Database Design
|
||||
|
||||
**Location**: `src/yc_bench/db/`
|
||||
|
||||
## Design Choice: SQLAlchemy ORM with SQLite
|
||||
|
||||
The benchmark uses SQLAlchemy's declarative ORM over SQLite for several reasons:
|
||||
|
||||
1. **Single-file persistence**: SQLite stores the entire game state in one file, making runs portable and inspectable
|
||||
2. **Transactional safety**: ACID guarantees prevent partial state updates
|
||||
3. **Query flexibility**: SQL allows complex queries for financial reports, task filtering, etc.
|
||||
4. **Dual-backend support**: The same ORM works with PostgreSQL via `DATABASE_URL` environment variable for production/scaling scenarios
|
||||
|
||||
## Schema Overview
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌───────────────────┐
|
||||
│ Company │────<│ CompanyPrestige │ (1 per domain × company)
|
||||
└──────┬───────┘ └───────────────────┘
|
||||
│
|
||||
├────<┌──────────────┐ ┌──────────────────┐
|
||||
│ │ Employee │────<│ EmployeeSkillRate │ (1 per domain × employee)
|
||||
│ └──────┬───────┘ └──────────────────┘
|
||||
│ │
|
||||
│ │ ┌────────────────┐
|
||||
│ └───<│ TaskAssignment │ (employee ↔ task junction)
|
||||
│ └────────┬───────┘
|
||||
│ │
|
||||
├────<┌──────────┐────────┘
|
||||
│ │ Task │────<┌─────────────────┐
|
||||
│ └──────────┘ │ TaskRequirement │ (1 per domain × task)
|
||||
│ └─────────────────┘
|
||||
│
|
||||
├────<┌──────────────┐
|
||||
│ │ SimEvent │ (discrete events queue)
|
||||
│ └──────────────┘
|
||||
│
|
||||
├────<┌──────────────┐
|
||||
│ │ LedgerEntry │ (financial transactions)
|
||||
│ └──────────────┘
|
||||
│
|
||||
├────<┌──────────────┐
|
||||
│ │ SimState │ (simulation clock & counters)
|
||||
│ └──────────────┘
|
||||
│
|
||||
└────<┌──────────────┐
|
||||
│ Scratchpad │ (agent persistent memory)
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
## Model Details
|
||||
|
||||
### Company (`models/company.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `id` | UUID (PK) | Auto-generated |
|
||||
| `name` | String | Company name |
|
||||
| `funds_cents` | BigInteger | Financial balance in cents |
|
||||
|
||||
**Design choice**: Funds stored in cents (integer) to avoid floating-point rounding errors in financial calculations. BigInteger supports very large/negative values.
|
||||
|
||||
### CompanyPrestige (`models/company.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `company_id` | UUID (FK) | References Company |
|
||||
| `domain` | String | research / inference / data_environment / training |
|
||||
| `prestige_level` | Float | Range [1.0, 10.0] |
|
||||
|
||||
**Design choice**: Prestige is tracked per-domain rather than as a single score. This forces specialization trade-offs and creates a 4-dimensional progression space.
|
||||
|
||||
### Employee (`models/employee.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `id` | UUID (PK) | Auto-generated |
|
||||
| `company_id` | UUID (FK) | References Company |
|
||||
| `name` | String | Employee name |
|
||||
| `tier` | String | junior / mid / senior |
|
||||
| `work_hours_per_day` | Float | Hours available per business day |
|
||||
| `salary_cents` | BigInteger | Monthly salary in cents |
|
||||
|
||||
### EmployeeSkillRate (`models/employee.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `employee_id` | UUID (FK) | References Employee |
|
||||
| `domain` | String | One of 4 domains |
|
||||
| `rate_domain_per_hour` | Float | Work units produced per hour |
|
||||
|
||||
**Design choice**: Skill rates are **hidden from the agent**. The agent sees tier and salary but not per-domain effectiveness. This creates an information asymmetry puzzle -- the agent must infer employee strengths from task outcomes.
|
||||
|
||||
### Task (`models/task.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `id` | UUID (PK) | Auto-generated |
|
||||
| `company_id` | UUID (FK, nullable) | NULL = market task, set on acceptance |
|
||||
| `status` | Enum | market → planned → active → completed_success / completed_fail / cancelled |
|
||||
| `title` | String | Task description |
|
||||
| `required_prestige` | Float | Minimum prestige needed in ALL task domains |
|
||||
| `reward_funds_cents` | BigInteger | Payment on successful completion |
|
||||
| `reward_prestige_delta` | Float | Prestige gained per domain on success |
|
||||
| `skill_boost_pct` | Float | Employee skill rate increase on success |
|
||||
| `accepted_at` | DateTime (nullable) | When task was accepted from market |
|
||||
| `deadline` | DateTime (nullable) | Calculated at acceptance |
|
||||
| `completed_at` | DateTime (nullable) | When task finished |
|
||||
| `success` | Boolean (nullable) | True = on-time, False = late |
|
||||
| `progress_milestone_pct` | Float | Tracks progress milestones (e.g., 50%) |
|
||||
|
||||
**Design choice**: `company_id` being nullable elegantly distinguishes market tasks (available for browsing) from accepted tasks (owned by the company).
|
||||
|
||||
### TaskRequirement (`models/task.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `task_id` | UUID (FK) | References Task |
|
||||
| `domain` | String | Which domain this requirement covers |
|
||||
| `required_qty` | Float | Total work units needed |
|
||||
| `completed_qty` | Float | Work units completed so far |
|
||||
|
||||
**Design choice**: Multi-domain requirements make tasks a multi-dimensional optimization problem. A task might need work in 2-4 domains simultaneously.
|
||||
|
||||
### TaskAssignment (`models/task.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `task_id` | UUID (FK) | References Task |
|
||||
| `employee_id` | UUID (FK) | References Employee |
|
||||
| `assigned_at` | DateTime | When assigned |
|
||||
|
||||
**Design choice**: Many-to-many junction table. An employee can work on multiple tasks (throughput splits), and a task can have multiple employees (parallel progress).
|
||||
|
||||
### SimEvent (`models/event.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `id` | UUID (PK) | Deterministic (uuid5) |
|
||||
| `company_id` | UUID (FK) | References Company |
|
||||
| `event_type` | String | task_completed / bankruptcy / task_half / horizon_end |
|
||||
| `scheduled_at` | DateTime | When event triggers |
|
||||
| `payload` | JSON | Event-specific data |
|
||||
| `dedupe_key` | String | Prevents duplicate events |
|
||||
| `consumed` | Boolean | True after processing |
|
||||
|
||||
### LedgerEntry (`models/ledger.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `id` | UUID (PK) | Auto-generated |
|
||||
| `company_id` | UUID (FK) | References Company |
|
||||
| `occurred_at` | DateTime | Transaction timestamp |
|
||||
| `category` | Enum | MONTHLY_PAYROLL / TASK_REWARD / TASK_FAIL_PENALTY / TASK_CANCEL_PENALTY |
|
||||
| `amount_cents` | BigInteger | Signed amount (negative = cost) |
|
||||
| `ref_type` | String (nullable) | Reference entity type |
|
||||
| `ref_id` | UUID (nullable) | Reference entity ID |
|
||||
|
||||
**Design choice**: Immutable append-only ledger provides a complete financial audit trail. No entries are ever deleted or modified.
|
||||
|
||||
### SimState (`models/sim_state.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `company_id` | UUID (FK, PK) | References Company |
|
||||
| `sim_time` | DateTime | Current simulation clock |
|
||||
| `run_seed` | Integer | RNG seed for reproducibility |
|
||||
| `horizon_end` | DateTime | When simulation ends |
|
||||
| `replenish_counter` | Integer | Tracks market task replenishment |
|
||||
|
||||
### Scratchpad (`models/scratchpad.py`)
|
||||
|
||||
| Column | Type | Notes |
|
||||
|--------|------|-------|
|
||||
| `company_id` | UUID (FK) | References Company |
|
||||
| `content` | Text | Free-form agent notes |
|
||||
|
||||
**Design choice**: Scratchpad survives LLM context truncation, giving the agent persistent memory across the full simulation.
|
||||
|
||||
## Session Management (`session.py`)
|
||||
|
||||
```python
|
||||
session_scope(factory) → context manager
|
||||
```
|
||||
|
||||
- Creates a scoped session with automatic commit/rollback
|
||||
- Supports both SQLite (default) and PostgreSQL (via `DATABASE_URL`)
|
||||
- `init_db()` creates all tables from ORM metadata
|
||||
|
||||
**Design choice**: Context manager pattern ensures every database operation is properly transacted, preventing partial state updates that would corrupt the simulation.
|
||||
144
system_design/03_task_system.md
Normal file
144
system_design/03_task_system.md
Normal file
|
|
@ -0,0 +1,144 @@
|
|||
# Task System
|
||||
|
||||
**Location**: `src/yc_bench/cli/task_commands.py`, `src/yc_bench/core/eta.py`, `src/yc_bench/core/progress.py`
|
||||
|
||||
## Task Lifecycle
|
||||
|
||||
```
|
||||
market ──accept──> planned ──dispatch──> active ──complete──> completed_success
|
||||
│ │ completed_fail
|
||||
│ │
|
||||
└──cancel──> cancelled <──cancel──┘
|
||||
```
|
||||
|
||||
### States
|
||||
|
||||
| Status | Meaning |
|
||||
|--------|---------|
|
||||
| `market` | Available for browsing, not yet accepted |
|
||||
| `planned` | Accepted by company, employees can be assigned |
|
||||
| `active` | Dispatched, work is progressing |
|
||||
| `completed_success` | Finished on time |
|
||||
| `completed_fail` | Finished late (past deadline) |
|
||||
| `cancelled` | Abandoned by agent |
|
||||
|
||||
## Design Choices
|
||||
|
||||
### Two-Phase Activation (Accept → Dispatch)
|
||||
|
||||
Tasks go through `planned` before `active`. This separation:
|
||||
|
||||
1. **Allows pre-assignment**: Agent can assign employees before starting the clock
|
||||
2. **Deadline starts at accept**: Creates urgency -- planning time counts against the deadline
|
||||
3. **Forces commitment**: Accepting a task reserves it but the agent must still dispatch
|
||||
|
||||
### Deadline Calculation
|
||||
|
||||
```
|
||||
deadline = accepted_at + max(required_qty[d] for all domains d) / deadline_qty_per_day
|
||||
```
|
||||
|
||||
**Design choice**: Deadline is proportional to the largest single-domain requirement, not the sum. This means multi-domain tasks don't get proportionally more time -- they require parallel work.
|
||||
|
||||
### Prestige Gating at Accept Time
|
||||
|
||||
```python
|
||||
def task_accept(task_id):
|
||||
for domain in task.requirements:
|
||||
if company_prestige[domain] < task.required_prestige:
|
||||
reject("Insufficient prestige in {domain}")
|
||||
```
|
||||
|
||||
**Design choice**: Prestige check is per-domain. A task requiring prestige 3.0 with requirements in `research` and `inference` needs prestige >= 3.0 in BOTH domains. This prevents gaming by maxing one domain.
|
||||
|
||||
### Cancel Penalties
|
||||
|
||||
Cancelling an active task incurs:
|
||||
- Prestige penalty: `reward_prestige_delta × cancel_multiplier` (configurable per difficulty)
|
||||
- No financial penalty (just lost opportunity)
|
||||
|
||||
**Design choice**: Cancel penalties prevent the strategy of accepting everything and dropping what's inconvenient. Higher difficulties increase the cancel multiplier.
|
||||
|
||||
## Employee Assignment
|
||||
|
||||
### Assignment Rules
|
||||
|
||||
- Employees can only be assigned to `planned` or `active` tasks
|
||||
- An employee can work on multiple tasks simultaneously (throughput splits)
|
||||
- Multiple employees can work on the same task (parallel progress)
|
||||
|
||||
### Throughput Splitting
|
||||
|
||||
```
|
||||
effective_rate = base_rate_per_hour / num_active_tasks
|
||||
```
|
||||
|
||||
**Design choice**: Linear throughput splitting creates a fundamental trade-off:
|
||||
- **Focus**: 1 employee on 1 task = full speed
|
||||
- **Parallel**: 1 employee on 3 tasks = 1/3 speed each
|
||||
- The agent must decide between fast completion of few tasks vs. slow progress on many
|
||||
|
||||
## Progress Tracking (`progress.py`)
|
||||
|
||||
### How Work Gets Done
|
||||
|
||||
Progress is calculated lazily during `advance_time()`:
|
||||
|
||||
```python
|
||||
for each active task:
|
||||
for each assigned employee:
|
||||
for each domain in task requirements:
|
||||
work = employee.skill_rate[domain] / num_active_tasks × business_hours
|
||||
requirement.completed_qty += work
|
||||
requirement.completed_qty = min(completed_qty, required_qty)
|
||||
```
|
||||
|
||||
### Multi-Domain Completion
|
||||
|
||||
A task is complete when ALL domain requirements reach `completed_qty >= required_qty`. The slowest domain is the bottleneck.
|
||||
|
||||
**Design choice**: This creates interesting optimization puzzles. If a task needs 100 units of research and 50 units of training, the agent should allocate more research-skilled employees to balance completion times.
|
||||
|
||||
## ETA Solver (`eta.py`)
|
||||
|
||||
### Completion Time Calculation
|
||||
|
||||
```python
|
||||
def solve_task_completion_time(task, assignments, sim_time):
|
||||
for each domain d:
|
||||
remaining = required_qty[d] - completed_qty[d]
|
||||
rate = sum(effective_rate[emp][d] for emp in assignments)
|
||||
if rate == 0:
|
||||
return infinity # no one can work on this domain
|
||||
hours_needed[d] = remaining / rate
|
||||
|
||||
max_hours = max(hours_needed.values())
|
||||
return sim_time + max_hours (in business hours)
|
||||
```
|
||||
|
||||
### Halfway Time Calculation
|
||||
|
||||
Used for milestone events. Finds the time when weighted average across domains reaches 50%.
|
||||
|
||||
### When ETAs Are Recalculated
|
||||
|
||||
- Task dispatched (new active task)
|
||||
- Employee assigned/unassigned
|
||||
- Task completed (frees employee throughput for other tasks)
|
||||
- Task cancelled (same)
|
||||
|
||||
**Design choice**: Dynamic ETA recalculation ensures events are always accurate. When an employee is reassigned, all affected tasks get new completion projections.
|
||||
|
||||
## Market Task Generation
|
||||
|
||||
See [09_configuration.md](09_configuration.md) for details on how market tasks are generated with stratified prestige distribution and randomized requirements.
|
||||
|
||||
### Browsing and Filtering
|
||||
|
||||
The `market browse` command supports:
|
||||
- Domain filter
|
||||
- Prestige range filter
|
||||
- Reward range filter
|
||||
- Pagination (offset/limit)
|
||||
|
||||
All output is JSON for agent consumption.
|
||||
123
system_design/04_prestige_system.md
Normal file
123
system_design/04_prestige_system.md
Normal file
|
|
@ -0,0 +1,123 @@
|
|||
# Prestige System
|
||||
|
||||
**Location**: `src/yc_bench/db/models/company.py` (CompanyPrestige), `src/yc_bench/core/engine.py` (decay), `src/yc_bench/core/handlers/task_complete.py` (rewards/penalties)
|
||||
|
||||
## Overview
|
||||
|
||||
Prestige is YC-Bench's core progression mechanic. It controls access to higher-tier tasks (which offer better rewards) and decays over time, forcing continuous engagement.
|
||||
|
||||
## Design Choices
|
||||
|
||||
### Per-Domain Prestige (4 Independent Tracks)
|
||||
|
||||
```
|
||||
research: ████████░░ (8.0)
|
||||
inference: ██████░░░░ (6.0)
|
||||
data_environment: ███░░░░░░░ (3.0)
|
||||
training: █████░░░░░ (5.0)
|
||||
```
|
||||
|
||||
**Why 4 domains?** This creates a 4-dimensional strategic space:
|
||||
- The agent can't max all domains simultaneously (decay + limited employees)
|
||||
- Specialization unlocks high-tier tasks in 1-2 domains
|
||||
- Diversification provides resilience but slower progression
|
||||
- Multi-domain tasks require balanced prestige across their domains
|
||||
|
||||
### Prestige Range: [1.0, 10.0]
|
||||
|
||||
| Level | Meaning |
|
||||
|-------|---------|
|
||||
| 1.0 | Minimum (starting/decayed) |
|
||||
| 3.0-4.0 | Mid-tier tasks accessible |
|
||||
| 7.0-8.0 | High-tier tasks accessible |
|
||||
| 10.0 | Maximum (hard cap) |
|
||||
|
||||
**Design choice**: The 1-10 range is intuitive and provides enough granularity for meaningful gating tiers without over-complicating the system.
|
||||
|
||||
## Prestige Gain
|
||||
|
||||
On successful task completion (on-time):
|
||||
|
||||
```
|
||||
for each domain in task.requirements:
|
||||
company_prestige[domain] += task.reward_prestige_delta
|
||||
company_prestige[domain] = min(prestige, 10.0) # cap
|
||||
```
|
||||
|
||||
**Design choice**: Prestige gain is per-domain and tied to the task's requirements. Completing a research+inference task only boosts those two domains, not training or data_environment.
|
||||
|
||||
### Prestige Scaling of Rewards
|
||||
|
||||
```
|
||||
actual_reward = base_reward × (1 + reward_prestige_scale × (prestige - 1))
|
||||
```
|
||||
|
||||
Higher prestige in a domain means better financial returns from tasks in that domain. This creates a virtuous cycle: more prestige → more money → more capacity → more prestige.
|
||||
|
||||
## Prestige Loss
|
||||
|
||||
### Decay (Daily)
|
||||
|
||||
```
|
||||
prestige -= decay_per_day × days_elapsed
|
||||
prestige = max(prestige, 1.0) # floor
|
||||
```
|
||||
|
||||
Default decay rate: -0.005/day. This is slow enough to not punish short gaps but fast enough that inactive domains eventually return to baseline.
|
||||
|
||||
**Design choice**: Continuous decay prevents "build once, exploit forever" strategies. The agent must continuously complete tasks in a domain to maintain access.
|
||||
|
||||
### Failure Penalty
|
||||
|
||||
On late task completion:
|
||||
|
||||
```
|
||||
for each domain in task.requirements:
|
||||
company_prestige[domain] -= task.reward_prestige_delta × fail_multiplier
|
||||
company_prestige[domain] = max(prestige, 1.0)
|
||||
```
|
||||
|
||||
Default `fail_multiplier`: 0.8. Late completion costs almost as much prestige as success would have gained.
|
||||
|
||||
### Cancel Penalty
|
||||
|
||||
On task cancellation:
|
||||
|
||||
```
|
||||
for each domain in task.requirements:
|
||||
company_prestige[domain] -= task.reward_prestige_delta × cancel_multiplier
|
||||
company_prestige[domain] = max(prestige, 1.0)
|
||||
```
|
||||
|
||||
Cancel multipliers vary by difficulty (higher on hard/nightmare).
|
||||
|
||||
## Prestige Gating
|
||||
|
||||
Tasks have a `required_prestige` field. At task acceptance:
|
||||
|
||||
```python
|
||||
for domain in task.requirements:
|
||||
if company_prestige[domain] < task.required_prestige:
|
||||
reject() # must meet prestige in ALL task domains
|
||||
```
|
||||
|
||||
**Design choice**: Per-domain gating means a task with `required_prestige=5.0` and requirements in research + training needs prestige >= 5.0 in BOTH research AND training. This prevents gaming.
|
||||
|
||||
### Stratified Market Tasks
|
||||
|
||||
The first 10 market tasks are always prestige-1 (accessible immediately). Higher prestige tasks are introduced with stratified distribution. This ensures:
|
||||
|
||||
- The agent always has something to work on initially
|
||||
- Progression is visible (new tasks unlock as prestige grows)
|
||||
- No dead-end states where the agent can't accept any task
|
||||
|
||||
## Strategic Implications
|
||||
|
||||
The prestige system creates several key strategic tensions:
|
||||
|
||||
1. **Specialize vs. Diversify**: Focus on 1-2 domains for deep access, or spread across all 4?
|
||||
2. **Risk vs. Reward**: High-prestige tasks pay more but failure costs more prestige
|
||||
3. **Maintenance vs. Growth**: Should the agent keep working in mastered domains (maintenance) or push new ones (growth)?
|
||||
4. **Accept vs. Defer**: Taking a task you might fail risks prestige loss; waiting risks decay
|
||||
|
||||
These tensions make the benchmark more than just "do tasks fast" -- it tests genuine strategic reasoning.
|
||||
162
system_design/05_financial_model.md
Normal file
162
system_design/05_financial_model.md
Normal file
|
|
@ -0,0 +1,162 @@
|
|||
# Financial Model
|
||||
|
||||
**Location**: `src/yc_bench/db/models/ledger.py`, `src/yc_bench/cli/finance_commands.py`, `src/yc_bench/cli/report_commands.py`, `src/yc_bench/core/handlers/`
|
||||
|
||||
## Overview
|
||||
|
||||
The financial model simulates a startup's cash flow: revenue from completed tasks, costs from employee payroll, and penalties for failures. Running out of money triggers bankruptcy and ends the simulation.
|
||||
|
||||
## Design Choices
|
||||
|
||||
### Cents-Based Integer Arithmetic
|
||||
|
||||
All financial values are stored as `BigInteger` in cents:
|
||||
|
||||
```
|
||||
$1,000.00 = 100_000 cents
|
||||
```
|
||||
|
||||
**Why cents?** Floating-point arithmetic introduces rounding errors that compound over hundreds of transactions. Integer cents guarantee exact financial accounting -- critical for a deterministic benchmark.
|
||||
|
||||
### Immutable Append-Only Ledger
|
||||
|
||||
Every financial transaction creates a `LedgerEntry` that is never modified or deleted:
|
||||
|
||||
```python
|
||||
class LedgerEntry:
|
||||
category: MONTHLY_PAYROLL | TASK_REWARD | TASK_FAIL_PENALTY | TASK_CANCEL_PENALTY
|
||||
amount_cents: int # negative for costs, positive for revenue
|
||||
occurred_at: datetime
|
||||
ref_type: str # optional reference to source entity
|
||||
ref_id: UUID # optional reference ID
|
||||
```
|
||||
|
||||
**Why immutable?** An append-only ledger provides:
|
||||
- Complete audit trail for debugging
|
||||
- Ability to reconstruct balance at any point in time
|
||||
- No risk of silent data corruption
|
||||
- Natural fit for the `finance ledger` and `report monthly` CLI commands
|
||||
|
||||
## Revenue Sources
|
||||
|
||||
### Task Rewards
|
||||
|
||||
On successful (on-time) completion:
|
||||
|
||||
```
|
||||
reward = base_reward × (1 + prestige_scale × (avg_prestige - 1))
|
||||
```
|
||||
|
||||
Where `avg_prestige` is averaged across the task's required domains. Higher prestige = higher payouts.
|
||||
|
||||
**Design choice**: Prestige-scaled rewards create a positive feedback loop that mirrors real business dynamics -- reputation leads to better opportunities.
|
||||
|
||||
### Revenue Timing
|
||||
|
||||
Rewards are credited immediately upon task completion (when the `task_completed` event fires with `success=True`).
|
||||
|
||||
## Cost Sources
|
||||
|
||||
### Monthly Payroll
|
||||
|
||||
Payroll is deducted on the **first business day** of each month:
|
||||
|
||||
```
|
||||
total_payroll = sum(employee.salary_cents for all employees)
|
||||
```
|
||||
|
||||
**Design choice**: Monthly payroll creates predictable but unavoidable costs. The agent must maintain positive cash flow to cover it.
|
||||
|
||||
### Salary Bumps
|
||||
|
||||
Each completed task increases salaries:
|
||||
|
||||
```
|
||||
for each assigned employee:
|
||||
salary_cents *= 1.01 # 1% increase per completion
|
||||
```
|
||||
|
||||
**Design choice**: Compounding salary increases mean success has a hidden cost. Long-running simulations see payroll grow substantially, creating late-game financial pressure even as task rewards scale with prestige.
|
||||
|
||||
### Failure Penalties
|
||||
|
||||
Late task completion incurs no direct financial penalty beyond the missed reward opportunity. However, the prestige loss from failure reduces future reward scaling.
|
||||
|
||||
### Cancel Penalties
|
||||
|
||||
Cancellation may incur a financial penalty depending on configuration (some presets charge a fraction of the reward).
|
||||
|
||||
## Payroll-Event Tie-Breaking
|
||||
|
||||
When payroll and events fall on the same timestamp:
|
||||
|
||||
```
|
||||
Payroll is processed BEFORE events
|
||||
```
|
||||
|
||||
**Design choice**: This ordering is critical. If a task completes on the same day as payroll:
|
||||
1. Payroll deducts first (may push funds negative)
|
||||
2. Task completion reward credits (may save from bankruptcy)
|
||||
3. Bankruptcy check happens after both
|
||||
|
||||
This gives the agent the benefit of the doubt -- a task completing on payday can save the company.
|
||||
|
||||
## Bankruptcy
|
||||
|
||||
Bankruptcy triggers when `funds_cents < 0` after payroll processing:
|
||||
|
||||
```python
|
||||
if company.funds_cents < 0:
|
||||
insert_bankruptcy_event(session, company_id, sim_time)
|
||||
```
|
||||
|
||||
**Design choice**: Bankruptcy is checked only after payroll (not after penalties). This simplifies the model and makes payroll the primary survival constraint.
|
||||
|
||||
### Bankruptcy as Terminal State
|
||||
|
||||
Once bankruptcy fires, the simulation ends. There is no recovery mechanic.
|
||||
|
||||
**Why no bailout?** The benchmark tests whether the agent can sustainably manage a business. Allowing recovery would dilute this signal.
|
||||
|
||||
## Financial Reports
|
||||
|
||||
### Ledger Query (`finance ledger`)
|
||||
|
||||
The agent can query the full transaction history with filters:
|
||||
- Category filter
|
||||
- Date range filter
|
||||
- Pagination
|
||||
|
||||
### Monthly P&L (`report monthly`)
|
||||
|
||||
Aggregates transactions by month:
|
||||
|
||||
```
|
||||
Month Revenue Payroll Penalties Net
|
||||
2025-01 $50,000 $30,000 $0 $20,000
|
||||
2025-02 $35,000 $30,300 $5,000 -$300
|
||||
```
|
||||
|
||||
**Design choice**: Structured financial reporting gives the agent the data it needs to make informed decisions about task selection and resource allocation.
|
||||
|
||||
## Runway Calculation
|
||||
|
||||
The `company status` command includes a runway estimate:
|
||||
|
||||
```
|
||||
runway_months = funds_cents / monthly_payroll_cents
|
||||
```
|
||||
|
||||
This helps the agent gauge urgency. Low runway signals that the agent needs profitable tasks quickly.
|
||||
|
||||
## Difficulty Scaling
|
||||
|
||||
Financial pressure scales with difficulty preset:
|
||||
|
||||
| Preset | Initial Funds | Payroll Pressure | Penalties |
|
||||
|--------|--------------|-----------------|-----------|
|
||||
| tutorial | Very high | Low | Minimal |
|
||||
| easy | High | Moderate | Low |
|
||||
| medium | Moderate | Moderate | Standard |
|
||||
| hard | Low | High | 1.5x |
|
||||
| nightmare | Very low | Very high | 2x |
|
||||
143
system_design/06_employee_model.md
Normal file
143
system_design/06_employee_model.md
Normal file
|
|
@ -0,0 +1,143 @@
|
|||
# Employee Model
|
||||
|
||||
**Location**: `src/yc_bench/db/models/employee.py`, `src/yc_bench/services/generate_employees.py`, `src/yc_bench/core/progress.py`
|
||||
|
||||
## Overview
|
||||
|
||||
Employees are the company's productive resources. Each has a tier, salary, and hidden per-domain skill rates. The agent must figure out who is good at what through observation and assign them optimally.
|
||||
|
||||
## Design Choices
|
||||
|
||||
### Hidden Skill Rates (Information Asymmetry)
|
||||
|
||||
The agent sees:
|
||||
- Employee name, tier (junior/mid/senior), salary
|
||||
- Which tasks they're currently assigned to
|
||||
|
||||
The agent does NOT see:
|
||||
- Per-domain skill rates (`rate_domain_per_hour`)
|
||||
- Actual work output per hour
|
||||
|
||||
**Why hidden?** This is a core benchmark design decision:
|
||||
1. **Tests inference ability**: The agent must infer strengths from task completion patterns
|
||||
2. **Mirrors reality**: Real managers don't have exact productivity metrics for every skill dimension
|
||||
3. **Creates learning opportunity**: Early task assignments serve as "probes" to discover team capabilities
|
||||
4. **Rewards memory**: Agents that remember past performance can make better future assignments
|
||||
|
||||
### Tier System
|
||||
|
||||
| Tier | Typical Rate Range | Salary Range |
|
||||
|------|-------------------|--------------|
|
||||
| junior | Low | Low |
|
||||
| mid | Medium | Medium |
|
||||
| senior | High | High |
|
||||
|
||||
**Design choice**: Tiers provide a rough signal. Seniors are generally better but not always in every domain. A junior might excel in one domain while a senior is mediocre there. The tier-salary correlation creates a cost-benefit trade-off.
|
||||
|
||||
### Per-Domain Skill Rates
|
||||
|
||||
Each employee has 4 skill rates (one per domain):
|
||||
|
||||
```python
|
||||
class EmployeeSkillRate:
|
||||
domain: str # research, inference, data_environment, training
|
||||
rate_domain_per_hour: float # work units produced per business hour
|
||||
```
|
||||
|
||||
Rates are generated from configurable distributions (triangular, beta, etc.) during world seeding. Some employees are specialists (high in one domain, low in others); some are generalists.
|
||||
|
||||
**Design choice**: The 4-rate vector per employee creates a rich assignment optimization space. Optimal assignment requires matching employee strengths to task domain requirements.
|
||||
|
||||
## Throughput Splitting
|
||||
|
||||
When an employee works on multiple active tasks simultaneously:
|
||||
|
||||
```
|
||||
effective_rate = base_rate / num_active_tasks
|
||||
```
|
||||
|
||||
**Design choice**: Linear splitting (not diminishing returns or context-switching penalties) was chosen for simplicity and predictability. The agent can reason about it without hidden costs.
|
||||
|
||||
### Example
|
||||
|
||||
Employee Alice has `research_rate = 2.0/hr`:
|
||||
- Assigned to 1 task: contributes 2.0 research units/hour
|
||||
- Assigned to 3 tasks: contributes 0.67 research units/hour to each
|
||||
|
||||
### Implication for Strategy
|
||||
|
||||
The agent faces a fundamental trade-off:
|
||||
- **Focused assignment**: 1 employee → 1 task = fastest completion but no parallelism
|
||||
- **Spread assignment**: 1 employee → N tasks = slower per task but progress on multiple fronts
|
||||
- **Optimal**: Match the strategy to deadline pressure and task urgency
|
||||
|
||||
## Skill Growth
|
||||
|
||||
On successful task completion, assigned employees get a skill boost:
|
||||
|
||||
```python
|
||||
for each assigned employee:
|
||||
for each domain in task.requirements:
|
||||
skill_rate[domain] *= (1 + task.skill_boost_pct / 100)
|
||||
```
|
||||
|
||||
**Design choice**: Skill growth compounds over time. Early investments in employee development pay off later through faster task completion. This creates a "training vs. exploiting" tension.
|
||||
|
||||
### Salary Bumps (Hidden Cost of Growth)
|
||||
|
||||
Each task completion also increases salaries:
|
||||
|
||||
```python
|
||||
for each assigned employee:
|
||||
salary_cents *= 1.01 # 1% increase
|
||||
```
|
||||
|
||||
**Design choice**: Salary bumps mean that experienced employees cost more. The agent can't infinitely scale employee productivity without also scaling costs. After many completions, payroll may become a significant burden.
|
||||
|
||||
## Employee Generation (`generate_employees.py`)
|
||||
|
||||
### Process
|
||||
|
||||
1. Generate 10 employees per company (configurable)
|
||||
2. Assign tiers based on configured distribution (e.g., 30% junior, 40% mid, 30% senior)
|
||||
3. For each employee, generate 4 skill rates from per-tier distributions
|
||||
4. Set salary based on tier bracket
|
||||
|
||||
### Distribution Types
|
||||
|
||||
Skill rates are drawn from configurable distributions:
|
||||
- **Triangular**: min/mode/max (default -- creates realistic bell-curve-like distributions)
|
||||
- **Beta**: alpha/beta parameters (useful for skewed distributions)
|
||||
- **Normal**: mean/std (truncated to positive values)
|
||||
- **Uniform**: low/high
|
||||
- **Constant**: fixed value
|
||||
|
||||
**Design choice**: Configurable distributions allow difficulty presets to create different workforce profiles. Tutorial mode might use tight distributions (predictable employees), while nightmare mode uses wide distributions (unpredictable).
|
||||
|
||||
## Employee Visibility to Agent
|
||||
|
||||
The `employee list` CLI command returns:
|
||||
|
||||
```json
|
||||
{
|
||||
"employees": [
|
||||
{
|
||||
"id": "uuid",
|
||||
"name": "Alice Chen",
|
||||
"tier": "senior",
|
||||
"salary": "$8,000/mo",
|
||||
"active_tasks": 2
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Note: no skill rates, no per-domain breakdown, no historical performance. The agent must build this knowledge through experience.
|
||||
|
||||
## Strategic Considerations
|
||||
|
||||
1. **Discovery phase**: Early on, assign different employees to different domain tasks to learn strengths
|
||||
2. **Specialization**: Once strengths are known, match employees to their best domains
|
||||
3. **Load balancing**: Avoid overloading one employee (throughput splitting penalty)
|
||||
4. **Growth investment**: Assign employees to tasks in domains where they need improvement
|
||||
5. **Cost awareness**: Track which employees have had many salary bumps
|
||||
243
system_design/07_agent_layer.md
Normal file
243
system_design/07_agent_layer.md
Normal file
|
|
@ -0,0 +1,243 @@
|
|||
# Agent Layer
|
||||
|
||||
**Location**: `src/yc_bench/agent/`
|
||||
|
||||
## Overview
|
||||
|
||||
The agent layer connects an LLM to the simulation via a tool-use interface. It manages the conversation loop, prompt construction, tool execution, and run state tracking.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────┐
|
||||
│ Agent Loop │
|
||||
│ (loop.py) │
|
||||
├─────────────────────────┤
|
||||
│ ┌──────────┐ ┌──────┐ │
|
||||
│ │ Prompt │ │ Tools │ │
|
||||
│ │ Builder │ │ │ │
|
||||
│ └──────────┘ └──────┘ │
|
||||
├─────────────────────────┤
|
||||
│ LLM Runtime │
|
||||
│ (runtime/) │
|
||||
│ LiteLLM abstraction │
|
||||
├─────────────────────────┤
|
||||
│ Run State / Transcript │
|
||||
│ (run_state.py) │
|
||||
└─────────────────────────┘
|
||||
```
|
||||
|
||||
## Design Choices
|
||||
|
||||
### LiteLLM as LLM Abstraction (`runtime/`)
|
||||
|
||||
The agent uses [LiteLLM](https://github.com/BerriAI/litellm) to abstract away vendor differences:
|
||||
|
||||
```python
|
||||
# Supports: Anthropic, OpenAI, OpenRouter, Google Gemini, etc.
|
||||
response = litellm.completion(
|
||||
model="anthropic/claude-sonnet-4-20250514",
|
||||
messages=messages,
|
||||
tools=tools,
|
||||
)
|
||||
```
|
||||
|
||||
**Why LiteLLM?**
|
||||
- Single interface for all major LLM providers
|
||||
- Consistent tool-use format across providers
|
||||
- Easy to benchmark different models on the same scenarios
|
||||
- Handles auth, retries, and format conversion
|
||||
|
||||
### Tool-Use Interface (Not Text Parsing)
|
||||
|
||||
The agent interacts via structured tool calls, not text command parsing:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "run_command",
|
||||
"arguments": {
|
||||
"command": "yc-bench task list --status active"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Why tool-use?**
|
||||
- Eliminates parsing ambiguity
|
||||
- Works with all modern LLMs' native tool-use
|
||||
- Structured output from CLI commands (JSON) flows cleanly back
|
||||
- Reduces error rate vs. free-text command generation
|
||||
|
||||
### Available Tools
|
||||
|
||||
#### `run_command`
|
||||
Executes CLI commands in a subprocess. The agent can run any `yc-bench` CLI command.
|
||||
|
||||
```python
|
||||
def run_command(command: str) -> str:
|
||||
"""Execute a yc-bench CLI command and return output."""
|
||||
```
|
||||
|
||||
**Design choice**: Subprocess execution provides isolation. The agent can't accidentally modify simulation state outside of defined CLI commands.
|
||||
|
||||
#### `python_repl` (Optional)
|
||||
A persistent Python interpreter for calculations and data analysis.
|
||||
|
||||
```python
|
||||
def python_repl(code: str) -> str:
|
||||
"""Execute Python code and return output."""
|
||||
```
|
||||
|
||||
**Design choice**: Some agents benefit from being able to compute (e.g., calculate optimal assignments, project cash flow). This tool is optional and configurable.
|
||||
|
||||
## Agent Loop (`loop.py`)
|
||||
|
||||
### Main Loop
|
||||
|
||||
```python
|
||||
def run_agent_loop(runtime, session, company_id, cfg):
|
||||
while not terminal:
|
||||
# Build messages (system prompt + history)
|
||||
messages = build_messages(history, context)
|
||||
|
||||
# Call LLM
|
||||
response = runtime.completion(messages, tools)
|
||||
|
||||
# Process tool calls
|
||||
for tool_call in response.tool_calls:
|
||||
result = execute_tool(tool_call)
|
||||
history.append(tool_call, result)
|
||||
|
||||
# Check for terminal conditions
|
||||
if is_terminal(result):
|
||||
break
|
||||
|
||||
# Auto-resume if agent hasn't advanced simulation
|
||||
if turns_since_resume > max_turns_without_resume:
|
||||
force_resume()
|
||||
```
|
||||
|
||||
### Design Choices in the Loop
|
||||
|
||||
#### History Truncation
|
||||
|
||||
```python
|
||||
# Keep only last N turns to fit context window
|
||||
messages = system_prompt + history[-max_history_turns:]
|
||||
```
|
||||
|
||||
**Why truncate?** Long simulations generate hundreds of turns. Without truncation, the context would exceed any model's window. The scratchpad CLI command compensates for lost history.
|
||||
|
||||
#### Auto-Resume Forcing
|
||||
|
||||
If the agent doesn't call `yc-bench sim resume` for N turns, the loop forces one:
|
||||
|
||||
```python
|
||||
if turns_since_resume > cfg.loop.max_turns_without_resume:
|
||||
result = execute("yc-bench sim resume")
|
||||
```
|
||||
|
||||
**Why force?** Some models get stuck in analysis loops, repeatedly querying state without advancing. Auto-resume prevents infinite loops and ensures forward progress.
|
||||
|
||||
#### Turn Budget
|
||||
|
||||
The loop has a maximum turn count. This prevents runaway agents and bounds benchmark cost.
|
||||
|
||||
## Prompt Construction (`prompt.py`)
|
||||
|
||||
### System Prompt Structure
|
||||
|
||||
```
|
||||
1. Role description ("You are the CEO of an AI startup...")
|
||||
2. Available commands reference
|
||||
3. Current company status summary
|
||||
4. Strategic guidance (domain, prestige, deadlines)
|
||||
5. Constraints and rules
|
||||
```
|
||||
|
||||
**Design choice**: The system prompt provides enough context for the agent to understand its role without revealing internal mechanics (like hidden skill rates or exact formulas).
|
||||
|
||||
### Context Building
|
||||
|
||||
Each turn, the prompt may include:
|
||||
- Wake events from the last `sim resume`
|
||||
- Current funds and runway
|
||||
- Active task count and approaching deadlines
|
||||
- Prestige levels
|
||||
|
||||
This contextual information helps the agent make informed decisions without needing to query every turn.
|
||||
|
||||
## Run State (`run_state.py`)
|
||||
|
||||
### Transcript Recording
|
||||
|
||||
Every turn is recorded:
|
||||
|
||||
```python
|
||||
{
|
||||
"turn": 42,
|
||||
"messages": [...],
|
||||
"tool_calls": [...],
|
||||
"tool_results": [...],
|
||||
"timestamp": "2025-03-15T10:30:00",
|
||||
"tokens_used": 1500
|
||||
}
|
||||
```
|
||||
|
||||
**Design choice**: Full transcripts enable:
|
||||
- Post-hoc analysis of agent strategy
|
||||
- Debugging agent failures
|
||||
- Benchmark scoring based on decision quality
|
||||
- Comparison across models
|
||||
|
||||
### Output Format
|
||||
|
||||
The final rollout is saved as JSON:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "anthropic/claude-sonnet-4-20250514",
|
||||
"seed": 42,
|
||||
"config": "medium",
|
||||
"outcome": "horizon_end",
|
||||
"final_funds": 250000,
|
||||
"final_prestige": {"research": 7.2, ...},
|
||||
"turns": 187,
|
||||
"transcript": [...]
|
||||
}
|
||||
```
|
||||
|
||||
## Command Execution Policy (`commands/`)
|
||||
|
||||
### Command Allowlist
|
||||
|
||||
The agent can only execute `yc-bench` CLI commands. Arbitrary shell commands are blocked.
|
||||
|
||||
**Design choice**: Restricting to the CLI API ensures:
|
||||
- No direct database manipulation
|
||||
- No simulation state bypass
|
||||
- Fair comparison across models
|
||||
- Deterministic state transitions
|
||||
|
||||
### Error Handling
|
||||
|
||||
Invalid commands return structured error messages:
|
||||
|
||||
```json
|
||||
{"error": "Task not found", "task_id": "..."}
|
||||
```
|
||||
|
||||
**Design choice**: Structured errors help the agent understand and recover from mistakes, rather than receiving opaque stack traces.
|
||||
|
||||
## Retry and Timeout Logic
|
||||
|
||||
```python
|
||||
# Exponential backoff for LLM API calls
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
response = runtime.completion(messages, tools)
|
||||
break
|
||||
except RateLimitError:
|
||||
wait(2 ** attempt)
|
||||
```
|
||||
|
||||
**Design choice**: LLM APIs are unreliable. Retry logic ensures transient failures don't corrupt benchmark runs.
|
||||
173
system_design/08_cli_interface.md
Normal file
173
system_design/08_cli_interface.md
Normal file
|
|
@ -0,0 +1,173 @@
|
|||
# CLI Interface
|
||||
|
||||
**Location**: `src/yc_bench/cli/`
|
||||
|
||||
## Overview
|
||||
|
||||
The CLI is the agent's sole interface to the simulation. Every command returns structured JSON, enabling reliable parsing by LLMs.
|
||||
|
||||
## Design Choices
|
||||
|
||||
### JSON-Only Output
|
||||
|
||||
All CLI commands return JSON, never free-text:
|
||||
|
||||
```bash
|
||||
$ yc-bench company status
|
||||
{
|
||||
"company_name": "Nexus AI",
|
||||
"funds": "$150,000.00",
|
||||
"funds_cents": 15000000,
|
||||
"monthly_payroll": "$30,000.00",
|
||||
"runway_months": 5.0,
|
||||
"prestige": {
|
||||
"research": 3.5,
|
||||
"inference": 2.1,
|
||||
"data_environment": 1.0,
|
||||
"training": 4.2
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Why JSON?**
|
||||
- Unambiguous parsing by LLMs (vs. formatted tables)
|
||||
- Consistent structure across all commands
|
||||
- Easy to pipe into `python_repl` for analysis
|
||||
- Machine-readable without regex or text parsing
|
||||
|
||||
### Command Group Organization
|
||||
|
||||
| Group | File | Purpose |
|
||||
|-------|------|---------|
|
||||
| `company` | `company_commands.py` | Company status, prestige overview |
|
||||
| `employee` | `employee_commands.py` | Employee listing and details |
|
||||
| `market` | `market_commands.py` | Browse available tasks |
|
||||
| `task` | `task_commands.py` | Task lifecycle (accept/assign/dispatch/cancel/inspect/list) |
|
||||
| `sim` | `sim_commands.py` | Simulation control (resume) |
|
||||
| `finance` | `finance_commands.py` | Ledger queries |
|
||||
| `report` | `report_commands.py` | Monthly P&L reports |
|
||||
| `scratchpad` | `scratchpad_commands.py` | Persistent agent memory |
|
||||
|
||||
**Design choice**: Command groups mirror real business functions (operations, HR, finance, strategy). This makes the interface intuitive for LLM agents that have been trained on business concepts.
|
||||
|
||||
## Command Details
|
||||
|
||||
### Company Commands
|
||||
|
||||
#### `company status`
|
||||
Returns current funds, payroll, runway, and prestige levels per domain.
|
||||
|
||||
**Design choice**: Single command gives the agent a complete financial and strategic snapshot. Reduces the number of API calls needed per decision cycle.
|
||||
|
||||
### Employee Commands
|
||||
|
||||
#### `employee list`
|
||||
Returns all employees with tier, salary, and current active task count.
|
||||
|
||||
**Design choice**: Shows active task count but NOT skill rates. The agent must infer capabilities.
|
||||
|
||||
### Market Commands
|
||||
|
||||
#### `market browse [--domain X] [--min-prestige N] [--max-prestige N] [--offset O] [--limit L]`
|
||||
Browse available market tasks with optional filters.
|
||||
|
||||
**Design choice**: Filtering and pagination prevent information overload. The agent can focus on tasks matching its current prestige level and strategic goals.
|
||||
|
||||
### Task Commands
|
||||
|
||||
#### `task accept <task_id>`
|
||||
Accept a market task. Validates prestige requirements. Sets deadline.
|
||||
|
||||
#### `task assign <task_id> <employee_id>`
|
||||
Assign an employee to a planned/active task. Recalculates ETAs.
|
||||
|
||||
#### `task dispatch <task_id>`
|
||||
Start work on a planned task. Changes status to active.
|
||||
|
||||
#### `task cancel <task_id>`
|
||||
Cancel a task. Applies prestige penalty. Frees employees.
|
||||
|
||||
#### `task inspect <task_id>`
|
||||
Detailed view of a single task: requirements, progress, assignments, deadline.
|
||||
|
||||
#### `task list [--status X]`
|
||||
List company tasks with optional status filter.
|
||||
|
||||
**Design choice**: The accept → assign → dispatch flow gives the agent explicit control over each phase. This mirrors real project management where you scope, staff, and then kick off work.
|
||||
|
||||
### Simulation Commands
|
||||
|
||||
#### `sim resume`
|
||||
Advance simulation to the next event. Returns wake events.
|
||||
|
||||
```json
|
||||
{
|
||||
"advanced_to": "2025-02-15T09:00:00",
|
||||
"wake_events": [
|
||||
{"type": "task_completed", "task_id": "...", "success": true},
|
||||
{"type": "payroll", "amount": -3000000}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Design choice**: Resume is the only way to advance time. The agent explicitly chooses when to move forward, creating natural decision checkpoints.
|
||||
|
||||
### Finance Commands
|
||||
|
||||
#### `finance ledger [--category X] [--from DATE] [--to DATE] [--offset O] [--limit L]`
|
||||
Query the immutable transaction history.
|
||||
|
||||
**Design choice**: Full ledger access lets sophisticated agents analyze spending patterns and project future cash flow.
|
||||
|
||||
### Report Commands
|
||||
|
||||
#### `report monthly`
|
||||
Aggregated P&L by month.
|
||||
|
||||
**Design choice**: Monthly reports provide a higher-level financial view than raw ledger entries, useful for strategic planning.
|
||||
|
||||
### Scratchpad Commands
|
||||
|
||||
#### `scratchpad read`
|
||||
Read persistent notes.
|
||||
|
||||
#### `scratchpad write <content>`
|
||||
Overwrite scratchpad contents.
|
||||
|
||||
#### `scratchpad append <content>`
|
||||
Add to existing scratchpad.
|
||||
|
||||
#### `scratchpad clear`
|
||||
Clear scratchpad.
|
||||
|
||||
**Design choice**: The scratchpad is critical for long simulations where LLM context gets truncated. The agent can store:
|
||||
- Employee capability observations
|
||||
- Strategic plans
|
||||
- Financial projections
|
||||
- Task priority lists
|
||||
|
||||
This compensates for context window limitations and tests whether the agent proactively maintains external memory.
|
||||
|
||||
## Error Handling
|
||||
|
||||
All commands return structured errors:
|
||||
|
||||
```json
|
||||
{
|
||||
"error": "Insufficient prestige in research (have 2.3, need 4.0)"
|
||||
}
|
||||
```
|
||||
|
||||
**Design choice**: Descriptive error messages help the agent understand what went wrong and adjust its strategy, rather than failing silently or with cryptic messages.
|
||||
|
||||
## CLI Entry Point (`__main__.py`)
|
||||
|
||||
The CLI uses a command-line parser (likely Click or argparse) to route commands to handler functions. Each handler:
|
||||
|
||||
1. Opens a database session
|
||||
2. Validates inputs
|
||||
3. Performs the operation
|
||||
4. Returns JSON output
|
||||
5. Commits or rolls back the transaction
|
||||
|
||||
**Design choice**: Each CLI call is a self-contained transaction. This prevents partial state updates and ensures the simulation remains consistent.
|
||||
203
system_design/09_configuration.md
Normal file
203
system_design/09_configuration.md
Normal file
|
|
@ -0,0 +1,203 @@
|
|||
# Configuration System
|
||||
|
||||
**Location**: `src/yc_bench/config/`
|
||||
|
||||
## Overview
|
||||
|
||||
The configuration system uses Pydantic models validated from TOML preset files. It controls every aspect of the simulation: world generation parameters, difficulty tuning, agent behavior, and distribution specifications.
|
||||
|
||||
## Design Choices
|
||||
|
||||
### Pydantic Schema (`schema.py`)
|
||||
|
||||
The configuration hierarchy:
|
||||
|
||||
```
|
||||
ExperimentConfig
|
||||
├── AgentConfig # LLM model, tools, retry settings
|
||||
├── LoopConfig # Turn budget, auto-resume threshold
|
||||
├── SimConfig # Simulation parameters
|
||||
└── WorldConfig # World generation parameters
|
||||
├── CompanyConfig # Initial funds, starting prestige
|
||||
├── EmployeeConfig # Team size, tier distribution, salary ranges
|
||||
├── TaskConfig # Task count, domain requirements, deadlines
|
||||
└── PrestigeConfig # Decay rate, penalty multipliers, scaling
|
||||
```
|
||||
|
||||
**Why Pydantic?**
|
||||
- Type validation at load time (catch config errors early)
|
||||
- Default values with optional overrides
|
||||
- Discriminated unions for distribution specs
|
||||
- Clear documentation through type annotations
|
||||
- Serialization to/from TOML/JSON
|
||||
|
||||
### TOML Preset Files (`presets/`)
|
||||
|
||||
```toml
|
||||
# medium.toml
|
||||
[world]
|
||||
initial_funds_cents = 500_000_00
|
||||
|
||||
[world.prestige]
|
||||
decay_per_day = 0.005
|
||||
penalty_fail_multiplier = 0.8
|
||||
penalty_cancel_multiplier = 1.0
|
||||
|
||||
[world.tasks]
|
||||
count = 200
|
||||
deadline_qty_per_day = 11.0
|
||||
|
||||
[world.tasks.reward_funds]
|
||||
type = "triangular"
|
||||
min = 5000_00
|
||||
mode = 15000_00
|
||||
max = 50000_00
|
||||
```
|
||||
|
||||
**Why TOML?** Human-readable, supports comments, natural hierarchy via sections, widely supported in Python. Better than JSON for config files (comments), simpler than YAML (fewer gotchas).
|
||||
|
||||
### Preset Hierarchy
|
||||
|
||||
| Preset | Focus | Key Characteristics |
|
||||
|--------|-------|-------------------|
|
||||
| `default.toml` | Base | All defaults; other presets override selectively |
|
||||
| `tutorial.toml` | Learning | Relaxed deadlines, prestige-1 tasks only, high funds |
|
||||
| `easy.toml` | Casual | Relaxed deadlines, flat prestige requirements |
|
||||
| `medium.toml` | Standard | Prestige climbing, 2-domain tasks, 9-day deadlines |
|
||||
| `hard.toml` | Challenge | Prestige gating active, 7-day deadlines, 1.5x cancel penalty |
|
||||
| `nightmare.toml` | Extreme | Razor-thin margins, 6-day deadlines, 2x penalties |
|
||||
|
||||
**Design choice**: Preset-based difficulty rather than a single "difficulty slider" allows fine-grained control. Each preset can tune dozens of independent parameters.
|
||||
|
||||
### Config Loading (`loader.py`)
|
||||
|
||||
```python
|
||||
def load_config(preset_name: str) -> ExperimentConfig:
|
||||
base = load_toml("default.toml")
|
||||
overlay = load_toml(f"{preset_name}.toml")
|
||||
merged = deep_merge(base, overlay)
|
||||
return ExperimentConfig(**merged)
|
||||
```
|
||||
|
||||
**Design choice**: Config inheritance via deep merge. Presets only specify what differs from default, keeping preset files concise and maintainable.
|
||||
|
||||
## Distribution Specifications (`sampling.py`)
|
||||
|
||||
### The DistSpec System
|
||||
|
||||
Many world generation parameters use statistical distributions rather than fixed values:
|
||||
|
||||
```python
|
||||
class DistSpec(BaseModel):
|
||||
"""Discriminated union of distribution types."""
|
||||
type: Literal["triangular", "beta", "normal", "uniform", "constant"]
|
||||
# Parameters vary by type
|
||||
```
|
||||
|
||||
**Supported distributions:**
|
||||
|
||||
| Type | Parameters | Use Case |
|
||||
|------|-----------|----------|
|
||||
| `triangular` | min, mode, max | Task rewards, skill rates (natural asymmetric bell curve) |
|
||||
| `beta` | alpha, beta, scale | Prestige requirements (skewed toward low values) |
|
||||
| `normal` | mean, std | Symmetric variation around a target |
|
||||
| `uniform` | low, high | Equal probability across range |
|
||||
| `constant` | value | Fixed value (no randomness) |
|
||||
|
||||
**Why discriminated unions?** Pydantic validates the correct parameters for each distribution type at load time. Invalid combinations (e.g., triangular with alpha parameter) are caught before the simulation runs.
|
||||
|
||||
### Usage Example
|
||||
|
||||
```toml
|
||||
[world.tasks.reward_funds]
|
||||
type = "triangular"
|
||||
min = 5000_00
|
||||
mode = 15000_00
|
||||
max = 50000_00
|
||||
|
||||
[world.employees.junior_rate]
|
||||
type = "beta"
|
||||
alpha = 2.0
|
||||
beta = 5.0
|
||||
scale = 3.0
|
||||
```
|
||||
|
||||
## World Generation
|
||||
|
||||
### Seeding (`services/seed_world.py`)
|
||||
|
||||
```python
|
||||
def seed_world_transactional(session, cfg, seed):
|
||||
rng = create_rng(seed)
|
||||
company = create_company(session, cfg.world.company)
|
||||
employees = generate_employees(session, company, cfg.world.employees, rng)
|
||||
tasks = generate_tasks(session, cfg.world.tasks, rng)
|
||||
sim_state = create_sim_state(session, company, cfg.sim, seed)
|
||||
```
|
||||
|
||||
**Design choice**: Single-transaction world seeding ensures atomic creation. Either the entire world is created or nothing is -- no partial states.
|
||||
|
||||
### Employee Generation (`services/generate_employees.py`)
|
||||
|
||||
1. Generate N employees (default 10)
|
||||
2. Assign tiers from configured distribution (e.g., 30/40/30 junior/mid/senior)
|
||||
3. For each employee, sample 4 skill rates from per-tier distributions
|
||||
4. Set salary based on tier range
|
||||
|
||||
### Task Generation (`services/generate_tasks.py`)
|
||||
|
||||
1. Generate M tasks (default 200+)
|
||||
2. First 10 tasks are always prestige-1 (guaranteed accessible)
|
||||
3. Remaining tasks have stratified prestige requirements
|
||||
4. Each task gets 2-4 domain requirements sampled from distributions
|
||||
5. Rewards scale with prestige and task size
|
||||
|
||||
**Design choice**: Stratified generation ensures:
|
||||
- The agent always has starting tasks (prestige-1 guaranteed)
|
||||
- Tasks span the full prestige range (progression is possible)
|
||||
- No prestige "dead zones" where no tasks exist
|
||||
|
||||
### RNG Management (`services/rng.py`)
|
||||
|
||||
```python
|
||||
def create_rng(seed: int) -> numpy.random.Generator:
|
||||
return numpy.random.default_rng(seed)
|
||||
```
|
||||
|
||||
**Design choice**: Centralized RNG with explicit seed ensures full reproducibility. Same seed → same world → same event sequence (given same agent actions).
|
||||
|
||||
## Key Configuration Parameters
|
||||
|
||||
### Financial Tuning
|
||||
|
||||
| Parameter | Default | Effect |
|
||||
|-----------|---------|--------|
|
||||
| `initial_funds_cents` | 500,000 | Starting capital |
|
||||
| `reward_prestige_scale` | 0.15 | How much prestige amplifies rewards |
|
||||
| `salary_bump_pct` | 1.0 | Per-completion salary increase |
|
||||
|
||||
### Prestige Tuning
|
||||
|
||||
| Parameter | Default | Effect |
|
||||
|-----------|---------|--------|
|
||||
| `prestige_decay_per_day` | 0.005 | Daily prestige loss |
|
||||
| `penalty_fail_multiplier` | 0.8 | Prestige cost of late completion |
|
||||
| `penalty_cancel_multiplier` | 1.0 | Prestige cost of cancellation |
|
||||
| `prestige_min` | 1.0 | Floor value |
|
||||
| `prestige_max` | 10.0 | Ceiling value |
|
||||
|
||||
### Task Tuning
|
||||
|
||||
| Parameter | Default | Effect |
|
||||
|-----------|---------|--------|
|
||||
| `deadline_qty_per_day` | 11.0 | Deadline generosity |
|
||||
| `num_domains_per_task` | 2-4 | Multi-domain complexity |
|
||||
| `progress_milestone_pct` | 50 | When to fire halfway event |
|
||||
|
||||
### Agent Tuning
|
||||
|
||||
| Parameter | Default | Effect |
|
||||
|-----------|---------|--------|
|
||||
| `max_turns` | 500 | Hard turn limit |
|
||||
| `max_turns_without_resume` | 5 | Auto-resume threshold |
|
||||
| `history_truncation` | 50 | Turns kept in context |
|
||||
232
system_design/10_runner_orchestration.md
Normal file
232
system_design/10_runner_orchestration.md
Normal file
|
|
@ -0,0 +1,232 @@
|
|||
# Runner & Orchestration
|
||||
|
||||
**Location**: `src/yc_bench/runner/`
|
||||
|
||||
## Overview
|
||||
|
||||
The runner is the top-level orchestration layer that ties everything together: parsing arguments, loading configuration, initializing the database, seeding the world, starting the agent loop, and collecting results.
|
||||
|
||||
## Components
|
||||
|
||||
### Entry Point (`main.py`)
|
||||
|
||||
```python
|
||||
def run_benchmark(args):
|
||||
# 1. Load configuration
|
||||
cfg = load_config(args.config)
|
||||
|
||||
# 2. Initialize database
|
||||
engine, factory = init_db(db_path)
|
||||
|
||||
# 3. Seed world
|
||||
with session_scope(factory) as session:
|
||||
seed_world_transactional(session, cfg, args.seed)
|
||||
|
||||
# 4. Build agent runtime
|
||||
runtime = build_runtime(cfg.agent, args.model)
|
||||
|
||||
# 5. Start dashboard (if TTY)
|
||||
dashboard = Dashboard(cfg) if is_tty() else None
|
||||
|
||||
# 6. Run agent loop
|
||||
result = run_agent_loop(runtime, factory, cfg, dashboard)
|
||||
|
||||
# 7. Save results
|
||||
save_rollout(result, args.output)
|
||||
```
|
||||
|
||||
### Design Choices
|
||||
|
||||
#### Single-Command Invocation
|
||||
|
||||
```bash
|
||||
uv run yc-bench run --model gemini/gemini-3-flash --seed 1 --config medium
|
||||
```
|
||||
|
||||
**Why single command?** Benchmarks should be easy to reproduce. One command with explicit parameters (model, seed, config) fully specifies a run.
|
||||
|
||||
#### Database Per Run
|
||||
|
||||
Each run creates a fresh SQLite database:
|
||||
|
||||
```
|
||||
db/run_seed1_medium_2025-03-15.sqlite
|
||||
```
|
||||
|
||||
**Why per-run databases?**
|
||||
- Isolation: runs can't interfere with each other
|
||||
- Inspection: can analyze any run's final state after the fact
|
||||
- Reproducibility: re-running with same seed produces identical database
|
||||
- Parallelism: multiple runs can execute simultaneously
|
||||
|
||||
## Argument Parsing (`args.py`)
|
||||
|
||||
### Key Arguments
|
||||
|
||||
| Argument | Required | Description |
|
||||
|----------|----------|-------------|
|
||||
| `--model` | Yes | LLM model identifier (LiteLLM format) |
|
||||
| `--seed` | Yes | Random seed for world generation |
|
||||
| `--config` | No | Difficulty preset (default: "medium") |
|
||||
| `--output` | No | Output path for rollout JSON |
|
||||
| `--no-dashboard` | No | Disable live terminal UI |
|
||||
| `--max-turns` | No | Override turn limit |
|
||||
|
||||
**Design choice**: Required arguments are minimal (model + seed). Everything else has sensible defaults. This reduces barrier to running benchmarks while allowing full customization.
|
||||
|
||||
## Dashboard (`dashboard.py`)
|
||||
|
||||
### Live Terminal UI
|
||||
|
||||
The dashboard uses [Rich](https://github.com/Textualize/rich) to display real-time simulation state:
|
||||
|
||||
```
|
||||
┌─ YC-Bench Dashboard ──────────────────────────────┐
|
||||
│ Model: claude-sonnet-4 Seed: 42 Config: medium │
|
||||
│ Turn: 87/500 Sim Time: 2025-06-15 │
|
||||
├────────────────────────────────────────────────────┤
|
||||
│ Funds: $125,340 Runway: 4.2 months │
|
||||
│ Prestige: R:5.2 I:3.8 D:2.1 T:6.4 │
|
||||
│ Active Tasks: 3 Completed: 12 Failed: 1 │
|
||||
├────────────────────────────────────────────────────┤
|
||||
│ Last Action: task assign abc123 emp456 │
|
||||
│ Last Event: task_completed (success) │
|
||||
└────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Design choice**: The dashboard is for human observers, not the agent. It provides real-time visibility into benchmark runs without affecting agent behavior.
|
||||
|
||||
### Features
|
||||
|
||||
- Live fund tracking with trend indicators
|
||||
- Prestige levels per domain
|
||||
- Task status counters
|
||||
- Recent agent actions
|
||||
- Turn counter and simulation clock
|
||||
- Auto-refreshes on each turn
|
||||
|
||||
### Conditional Activation
|
||||
|
||||
Dashboard only activates when running in a TTY (interactive terminal). Redirected output or CI environments get plain log output.
|
||||
|
||||
**Why conditional?** Batch runs (scripts/) shouldn't have terminal UI overhead. Detecting TTY ensures the right output mode automatically.
|
||||
|
||||
## Session Management (`session.py`)
|
||||
|
||||
### Run Session
|
||||
|
||||
Manages the lifecycle of a single benchmark run:
|
||||
|
||||
```python
|
||||
class RunSession:
|
||||
db_path: str
|
||||
config: ExperimentConfig
|
||||
model: str
|
||||
seed: int
|
||||
start_time: datetime
|
||||
|
||||
def save_rollout(self, result):
|
||||
"""Save final rollout JSON to results/"""
|
||||
|
||||
def cleanup(self):
|
||||
"""Clean up temporary resources"""
|
||||
```
|
||||
|
||||
**Design choice**: Session object encapsulates all run-specific state, making it easy to serialize and manage runs.
|
||||
|
||||
## Batch Running (`scripts/`)
|
||||
|
||||
### Multi-Seed Runs
|
||||
|
||||
Scripts for running the same model across multiple seeds:
|
||||
|
||||
```bash
|
||||
# Run seeds 1-10 with claude-sonnet on medium difficulty
|
||||
for seed in $(seq 1 10); do
|
||||
uv run yc-bench run --model anthropic/claude-sonnet-4-20250514 --seed $seed --config medium
|
||||
done
|
||||
```
|
||||
|
||||
### Multi-Model Comparison
|
||||
|
||||
Scripts for comparing models on the same seeds:
|
||||
|
||||
```bash
|
||||
for model in "anthropic/claude-sonnet-4-20250514" "openai/gpt-4o" "google/gemini-pro"; do
|
||||
uv run yc-bench run --model $model --seed 42 --config medium
|
||||
done
|
||||
```
|
||||
|
||||
**Design choice**: Simple shell scripts rather than a complex orchestration framework. This keeps the benchmark tooling minimal and transparent.
|
||||
|
||||
## Results & Output
|
||||
|
||||
### Rollout JSON
|
||||
|
||||
Each run produces a rollout file:
|
||||
|
||||
```
|
||||
results/
|
||||
├── claude-sonnet_seed1_medium.json
|
||||
├── claude-sonnet_seed2_medium.json
|
||||
├── gpt-4o_seed1_medium.json
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Rollout Contents
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"model": "anthropic/claude-sonnet-4-20250514",
|
||||
"seed": 1,
|
||||
"config": "medium",
|
||||
"start_time": "2025-03-15T10:00:00",
|
||||
"end_time": "2025-03-15T10:45:00"
|
||||
},
|
||||
"outcome": "horizon_end",
|
||||
"final_state": {
|
||||
"funds_cents": 25000000,
|
||||
"prestige": {"research": 7.2, "inference": 5.1, ...},
|
||||
"tasks_completed": 24,
|
||||
"tasks_failed": 3,
|
||||
"tasks_cancelled": 1,
|
||||
"turns_used": 187
|
||||
},
|
||||
"transcript": [
|
||||
{"turn": 1, "action": "company status", "result": {...}},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Plots (`plots/`)
|
||||
|
||||
Visualization scripts for comparing model performance:
|
||||
- Funds over time
|
||||
- Prestige progression per domain
|
||||
- Task completion rates
|
||||
- Comparison charts across models/seeds
|
||||
|
||||
**Design choice**: Separate plotting from the benchmark runner. Results are stored as data (JSON); visualization is a post-processing step.
|
||||
|
||||
## Error Recovery
|
||||
|
||||
### Crash Recovery
|
||||
|
||||
If a run crashes (LLM timeout, OOM, etc.):
|
||||
- The SQLite database persists with the last consistent state
|
||||
- Rollout JSON may be partial but includes transcript up to the crash
|
||||
- Re-running with the same seed starts fresh (no resume from crash)
|
||||
|
||||
**Design choice**: No crash recovery by design. Benchmark runs should be atomic -- either complete or re-run. This prevents partial results from contaminating comparisons.
|
||||
|
||||
### Graceful Shutdown
|
||||
|
||||
On SIGINT (Ctrl+C):
|
||||
- Current turn completes
|
||||
- Partial rollout is saved
|
||||
- Database is committed
|
||||
- Dashboard is cleaned up
|
||||
|
||||
**Design choice**: Graceful shutdown preserves whatever data exists, useful for debugging long runs that need to be interrupted.
|
||||
Loading…
Add table
Add a link
Reference in a new issue