mirror of
https://github.com/collinear-ai/yc-bench.git
synced 2026-04-19 12:58:03 +00:00
Add system design documentation for yc-bench
Comprehensive documentation covering all major subsystems: simulation engine, data models, task system, prestige, finances, employees, agent layer, CLI interface, configuration, and runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
b1cd7ebfb2
commit
ecd3d9e415
11 changed files with 1858 additions and 0 deletions
98
system_design/00_overview.md
Normal file
98
system_design/00_overview.md
Normal file
|
|
@ -0,0 +1,98 @@
|
||||||
|
# YC-Bench: System Overview
|
||||||
|
|
||||||
|
## What is YC-Bench?
|
||||||
|
|
||||||
|
YC-Bench is a **long-horizon deterministic benchmark for LLM agents**. It simulates an AI startup CEO managing a company over 1-3 years through a CLI-based interface against a SQLite-backed discrete-event simulation engine. The benchmark tests sustained decision-making over hundreds of turns through compounding financial, prestige, and deadline pressures.
|
||||||
|
|
||||||
|
## Core Premise
|
||||||
|
|
||||||
|
An LLM agent is dropped into the role of CEO of a small AI startup. It must:
|
||||||
|
|
||||||
|
- Browse and accept tasks from a marketplace
|
||||||
|
- Assign employees to tasks across 4 technical domains
|
||||||
|
- Manage cash flow (payroll, rewards, penalties)
|
||||||
|
- Build prestige in each domain to unlock higher-tier tasks
|
||||||
|
- Survive until the simulation horizon ends without going bankrupt
|
||||||
|
|
||||||
|
## Key Metrics (~4,975 lines of Python)
|
||||||
|
|
||||||
|
| Dimension | Details |
|
||||||
|
|-----------|---------|
|
||||||
|
| Employees | 10 (hidden per-domain skill rates) |
|
||||||
|
| Market Tasks | 200+ (configurable) |
|
||||||
|
| Domains | 4: research, inference, data_environment, training |
|
||||||
|
| Prestige Range | 1.0 - 10.0 per domain |
|
||||||
|
| Difficulty Presets | tutorial, easy, medium, hard, nightmare |
|
||||||
|
|
||||||
|
## High-Level Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────┐
|
||||||
|
│ Runner / CLI │
|
||||||
|
│ (argument parsing, dashboard, session management) │
|
||||||
|
├─────────────────────────────────────────────────────┤
|
||||||
|
│ Agent Layer │
|
||||||
|
│ (LLM runtime, agent loop, tools, prompt building) │
|
||||||
|
├─────────────────────────────────────────────────────┤
|
||||||
|
│ CLI Command Interface │
|
||||||
|
│ (company, employee, market, task, sim, finance, │
|
||||||
|
│ report, scratchpad) │
|
||||||
|
├─────────────────────────────────────────────────────┤
|
||||||
|
│ Simulation Engine (core/) │
|
||||||
|
│ (event processing, ETA solving, progress tracking, │
|
||||||
|
│ business time, prestige decay) │
|
||||||
|
├─────────────────────────────────────────────────────┤
|
||||||
|
│ Data Layer (db/) │
|
||||||
|
│ (SQLAlchemy ORM models, session management) │
|
||||||
|
├─────────────────────────────────────────────────────┤
|
||||||
|
│ Configuration & World Generation │
|
||||||
|
│ (Pydantic schemas, TOML presets, seeding, RNG) │
|
||||||
|
└─────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Directory Map
|
||||||
|
|
||||||
|
```
|
||||||
|
~/yc_bench_fixed/
|
||||||
|
├── src/yc_bench/
|
||||||
|
│ ├── __main__.py # CLI entry point
|
||||||
|
│ ├── agent/ # Agent runtime and loop
|
||||||
|
│ ├── cli/ # Agent-facing CLI commands
|
||||||
|
│ ├── core/ # Simulation engine
|
||||||
|
│ ├── db/ # ORM models & session
|
||||||
|
│ ├── config/ # Pydantic schemas + TOML presets
|
||||||
|
│ ├── services/ # World generation & RNG
|
||||||
|
│ └── runner/ # Benchmark orchestration
|
||||||
|
├── scripts/ # Batch running scripts
|
||||||
|
├── db/ # SQLite databases (runtime)
|
||||||
|
├── results/ # Output JSON rollouts
|
||||||
|
├── plots/ # Result visualizations
|
||||||
|
├── pyproject.toml # Package definition (uv-based)
|
||||||
|
└── uv.lock # Lock file
|
||||||
|
```
|
||||||
|
|
||||||
|
## Execution Flow
|
||||||
|
|
||||||
|
1. User runs: `uv run yc-bench run --model <model> --seed 1 --config medium`
|
||||||
|
2. Runner loads config, initializes DB, seeds world, starts agent loop
|
||||||
|
3. Agent receives system prompt with company context and available CLI tools
|
||||||
|
4. Each turn: agent calls CLI commands via `run_command` tool, optionally `python_repl`
|
||||||
|
5. Agent calls `yc-bench sim resume` to advance simulation time
|
||||||
|
6. Simulation processes events (completions, payroll, milestones) and returns wake events
|
||||||
|
7. Loop continues until bankruptcy or horizon end
|
||||||
|
8. Output: rollout JSON transcript + SQLite game state
|
||||||
|
|
||||||
|
## Design Documents
|
||||||
|
|
||||||
|
| File | Topic |
|
||||||
|
|------|-------|
|
||||||
|
| [01_simulation_engine.md](01_simulation_engine.md) | Core simulation engine and event processing |
|
||||||
|
| [02_data_models.md](02_data_models.md) | Database schema and ORM design |
|
||||||
|
| [03_task_system.md](03_task_system.md) | Task lifecycle, ETA, and progress |
|
||||||
|
| [04_prestige_system.md](04_prestige_system.md) | Prestige mechanics, decay, and gating |
|
||||||
|
| [05_financial_model.md](05_financial_model.md) | Funds, payroll, ledger, and bankruptcy |
|
||||||
|
| [06_employee_model.md](06_employee_model.md) | Employee skills, throughput, and growth |
|
||||||
|
| [07_agent_layer.md](07_agent_layer.md) | LLM runtime, agent loop, and tools |
|
||||||
|
| [08_cli_interface.md](08_cli_interface.md) | CLI command groups and JSON output |
|
||||||
|
| [09_configuration.md](09_configuration.md) | Config schema, presets, and world generation |
|
||||||
|
| [10_runner_orchestration.md](10_runner_orchestration.md) | Benchmark runner, dashboard, and session |
|
||||||
147
system_design/01_simulation_engine.md
Normal file
147
system_design/01_simulation_engine.md
Normal file
|
|
@ -0,0 +1,147 @@
|
||||||
|
# Simulation Engine
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/core/`
|
||||||
|
|
||||||
|
## Design Choice: Discrete-Event Simulation
|
||||||
|
|
||||||
|
YC-Bench uses a **discrete-event simulation (DES)** model rather than a tick-based approach. This was chosen because:
|
||||||
|
|
||||||
|
1. **Determinism**: Events are processed in a fixed, reproducible order given the same seed
|
||||||
|
2. **Efficiency**: Time jumps between events rather than iterating every hour/day
|
||||||
|
3. **Clarity**: Each state change corresponds to a meaningful event, making the simulation auditable
|
||||||
|
|
||||||
|
## Core Loop (`engine.py`)
|
||||||
|
|
||||||
|
The `advance_time()` function is the heart of the simulation:
|
||||||
|
|
||||||
|
```
|
||||||
|
advance_time(session, company_id, cfg) → AdvanceResult
|
||||||
|
```
|
||||||
|
|
||||||
|
### Algorithm
|
||||||
|
|
||||||
|
1. **Flush progress** on all active tasks (convert elapsed business hours into completed work)
|
||||||
|
2. **Apply prestige decay** for elapsed days
|
||||||
|
3. **Process payroll** if crossing a month boundary (first business day)
|
||||||
|
4. **Fetch next unconsumed event** ordered by `(scheduled_at, priority)`
|
||||||
|
5. **Dispatch to handler** based on event type
|
||||||
|
6. **Recalculate ETAs** for affected tasks
|
||||||
|
7. **Update sim_time** to the event's timestamp
|
||||||
|
8. **Return wake events** to the agent
|
||||||
|
|
||||||
|
### Why "Resume" Rather Than Auto-Advance?
|
||||||
|
|
||||||
|
The agent explicitly calls `yc-bench sim resume` to advance time. This design:
|
||||||
|
|
||||||
|
- Gives the agent control over pacing (plan before advancing)
|
||||||
|
- Creates a natural decision checkpoint between simulation steps
|
||||||
|
- Allows multiple CLI queries before committing to advancing
|
||||||
|
- If the agent stalls (N turns without resume), the loop forces one automatically
|
||||||
|
|
||||||
|
## Event System (`events.py`)
|
||||||
|
|
||||||
|
### Event Types (Priority Order)
|
||||||
|
|
||||||
|
| Priority | Event Type | Trigger |
|
||||||
|
|----------|-----------|---------|
|
||||||
|
| 1 | `task_completed` | Task reaches 100% in all domain requirements |
|
||||||
|
| 2 | `bankruptcy` | Funds drop below zero after payroll |
|
||||||
|
| 3 | `task_half` | Task reaches 50% progress milestone |
|
||||||
|
| 4 | `horizon_end` | Simulation time limit reached |
|
||||||
|
|
||||||
|
### Design Choice: Fixed Priority Ordering
|
||||||
|
|
||||||
|
Events at the same timestamp are processed in strict priority order. This ensures:
|
||||||
|
|
||||||
|
- Task completions (and their rewards) are processed before bankruptcy checks
|
||||||
|
- A task finishing on the same day as payroll can save the company from bankruptcy
|
||||||
|
- Deterministic behavior regardless of insertion order
|
||||||
|
|
||||||
|
### Event Identity (Deterministic UUIDs)
|
||||||
|
|
||||||
|
Event IDs use `uuid5` based on payload + timestamp + dedupe_key. This means:
|
||||||
|
|
||||||
|
- Same world state produces identical event IDs
|
||||||
|
- Deduplication is automatic (re-inserting same event is a no-op)
|
||||||
|
- Full reproducibility across runs with same seed
|
||||||
|
|
||||||
|
## Event Handlers (`handlers/`)
|
||||||
|
|
||||||
|
### `task_complete.py`
|
||||||
|
- Finalizes all domain progress to 100%
|
||||||
|
- Success check: `sim_time <= deadline`
|
||||||
|
- On success: add reward funds, add prestige per domain, boost employee skill rates, apply 1% salary bump
|
||||||
|
- On failure (late): apply prestige penalty per domain (configurable multiplier)
|
||||||
|
|
||||||
|
### `task_half.py`
|
||||||
|
- Marks progress milestone reached
|
||||||
|
- Informational event for agent awareness (no state changes beyond flag)
|
||||||
|
|
||||||
|
### `bankruptcy.py`
|
||||||
|
- Triggered when `funds_cents < 0` after payroll
|
||||||
|
- Terminates the simulation with bankruptcy outcome
|
||||||
|
|
||||||
|
### `horizon_end.py`
|
||||||
|
- Triggered at configured simulation end date
|
||||||
|
- Terminates the simulation with final scoring
|
||||||
|
|
||||||
|
## Progress Tracking (`progress.py`)
|
||||||
|
|
||||||
|
### Effective Rate Calculation
|
||||||
|
|
||||||
|
```
|
||||||
|
effective_rate = base_rate_per_hour / num_active_tasks_for_this_employee
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Throughput splitting creates a resource allocation puzzle. An employee assigned to 3 tasks works at 1/3 speed on each. The agent must balance parallelism vs. focus.
|
||||||
|
|
||||||
|
### Progress Flush
|
||||||
|
|
||||||
|
When `advance_time()` runs, it calculates work done since the last flush:
|
||||||
|
|
||||||
|
```
|
||||||
|
work = effective_rate × business_hours_elapsed
|
||||||
|
completed_qty += work (capped at required_qty)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Business Time (`business_time.py`)
|
||||||
|
|
||||||
|
### Design Choice: Business Hours Only
|
||||||
|
|
||||||
|
Work only happens during business hours (weekdays, configurable hours per day). This adds:
|
||||||
|
|
||||||
|
- Realistic scheduling constraints
|
||||||
|
- Weekend gaps that affect deadline calculations
|
||||||
|
- A reason for the agent to think about calendar timing
|
||||||
|
|
||||||
|
## ETA Solver (`eta.py`)
|
||||||
|
|
||||||
|
### Completion Time
|
||||||
|
|
||||||
|
```
|
||||||
|
solve_task_completion_time():
|
||||||
|
For each domain d:
|
||||||
|
remaining[d] = required_qty[d] - completed_qty[d]
|
||||||
|
rate[d] = sum(effective_rate for assigned employees with skill in d)
|
||||||
|
time[d] = remaining[d] / rate[d]
|
||||||
|
completion_time = max(time[d]) across all domains
|
||||||
|
```
|
||||||
|
|
||||||
|
### Design Choice: Multi-Domain Bottleneck
|
||||||
|
|
||||||
|
A task completes when ALL domains finish. The slowest domain determines completion time. This creates interesting assignment puzzles where the agent must identify and address bottlenecks.
|
||||||
|
|
||||||
|
### Halfway Time
|
||||||
|
|
||||||
|
Used for progress milestone events. Calculated as the weighted midpoint across domains.
|
||||||
|
|
||||||
|
## Prestige Decay
|
||||||
|
|
||||||
|
```
|
||||||
|
apply_prestige_decay(session, company_id, days_elapsed, cfg):
|
||||||
|
for each domain:
|
||||||
|
prestige -= decay_per_day × days_elapsed
|
||||||
|
prestige = max(prestige, prestige_min) # floor at 1.0
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Decay prevents "set and forget" strategies. The agent must continuously work in domains to maintain access to high-tier tasks. Neglected domains revert to baseline.
|
||||||
190
system_design/02_data_models.md
Normal file
190
system_design/02_data_models.md
Normal file
|
|
@ -0,0 +1,190 @@
|
||||||
|
# Data Models & Database Design
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/db/`
|
||||||
|
|
||||||
|
## Design Choice: SQLAlchemy ORM with SQLite
|
||||||
|
|
||||||
|
The benchmark uses SQLAlchemy's declarative ORM over SQLite for several reasons:
|
||||||
|
|
||||||
|
1. **Single-file persistence**: SQLite stores the entire game state in one file, making runs portable and inspectable
|
||||||
|
2. **Transactional safety**: ACID guarantees prevent partial state updates
|
||||||
|
3. **Query flexibility**: SQL allows complex queries for financial reports, task filtering, etc.
|
||||||
|
4. **Dual-backend support**: The same ORM works with PostgreSQL via `DATABASE_URL` environment variable for production/scaling scenarios
|
||||||
|
|
||||||
|
## Schema Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────┐ ┌───────────────────┐
|
||||||
|
│ Company │────<│ CompanyPrestige │ (1 per domain × company)
|
||||||
|
└──────┬───────┘ └───────────────────┘
|
||||||
|
│
|
||||||
|
├────<┌──────────────┐ ┌──────────────────┐
|
||||||
|
│ │ Employee │────<│ EmployeeSkillRate │ (1 per domain × employee)
|
||||||
|
│ └──────┬───────┘ └──────────────────┘
|
||||||
|
│ │
|
||||||
|
│ │ ┌────────────────┐
|
||||||
|
│ └───<│ TaskAssignment │ (employee ↔ task junction)
|
||||||
|
│ └────────┬───────┘
|
||||||
|
│ │
|
||||||
|
├────<┌──────────┐────────┘
|
||||||
|
│ │ Task │────<┌─────────────────┐
|
||||||
|
│ └──────────┘ │ TaskRequirement │ (1 per domain × task)
|
||||||
|
│ └─────────────────┘
|
||||||
|
│
|
||||||
|
├────<┌──────────────┐
|
||||||
|
│ │ SimEvent │ (discrete events queue)
|
||||||
|
│ └──────────────┘
|
||||||
|
│
|
||||||
|
├────<┌──────────────┐
|
||||||
|
│ │ LedgerEntry │ (financial transactions)
|
||||||
|
│ └──────────────┘
|
||||||
|
│
|
||||||
|
├────<┌──────────────┐
|
||||||
|
│ │ SimState │ (simulation clock & counters)
|
||||||
|
│ └──────────────┘
|
||||||
|
│
|
||||||
|
└────<┌──────────────┐
|
||||||
|
│ Scratchpad │ (agent persistent memory)
|
||||||
|
└──────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Model Details
|
||||||
|
|
||||||
|
### Company (`models/company.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `id` | UUID (PK) | Auto-generated |
|
||||||
|
| `name` | String | Company name |
|
||||||
|
| `funds_cents` | BigInteger | Financial balance in cents |
|
||||||
|
|
||||||
|
**Design choice**: Funds stored in cents (integer) to avoid floating-point rounding errors in financial calculations. BigInteger supports very large/negative values.
|
||||||
|
|
||||||
|
### CompanyPrestige (`models/company.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `company_id` | UUID (FK) | References Company |
|
||||||
|
| `domain` | String | research / inference / data_environment / training |
|
||||||
|
| `prestige_level` | Float | Range [1.0, 10.0] |
|
||||||
|
|
||||||
|
**Design choice**: Prestige is tracked per-domain rather than as a single score. This forces specialization trade-offs and creates a 4-dimensional progression space.
|
||||||
|
|
||||||
|
### Employee (`models/employee.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `id` | UUID (PK) | Auto-generated |
|
||||||
|
| `company_id` | UUID (FK) | References Company |
|
||||||
|
| `name` | String | Employee name |
|
||||||
|
| `tier` | String | junior / mid / senior |
|
||||||
|
| `work_hours_per_day` | Float | Hours available per business day |
|
||||||
|
| `salary_cents` | BigInteger | Monthly salary in cents |
|
||||||
|
|
||||||
|
### EmployeeSkillRate (`models/employee.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `employee_id` | UUID (FK) | References Employee |
|
||||||
|
| `domain` | String | One of 4 domains |
|
||||||
|
| `rate_domain_per_hour` | Float | Work units produced per hour |
|
||||||
|
|
||||||
|
**Design choice**: Skill rates are **hidden from the agent**. The agent sees tier and salary but not per-domain effectiveness. This creates an information asymmetry puzzle -- the agent must infer employee strengths from task outcomes.
|
||||||
|
|
||||||
|
### Task (`models/task.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `id` | UUID (PK) | Auto-generated |
|
||||||
|
| `company_id` | UUID (FK, nullable) | NULL = market task, set on acceptance |
|
||||||
|
| `status` | Enum | market → planned → active → completed_success / completed_fail / cancelled |
|
||||||
|
| `title` | String | Task description |
|
||||||
|
| `required_prestige` | Float | Minimum prestige needed in ALL task domains |
|
||||||
|
| `reward_funds_cents` | BigInteger | Payment on successful completion |
|
||||||
|
| `reward_prestige_delta` | Float | Prestige gained per domain on success |
|
||||||
|
| `skill_boost_pct` | Float | Employee skill rate increase on success |
|
||||||
|
| `accepted_at` | DateTime (nullable) | When task was accepted from market |
|
||||||
|
| `deadline` | DateTime (nullable) | Calculated at acceptance |
|
||||||
|
| `completed_at` | DateTime (nullable) | When task finished |
|
||||||
|
| `success` | Boolean (nullable) | True = on-time, False = late |
|
||||||
|
| `progress_milestone_pct` | Float | Tracks progress milestones (e.g., 50%) |
|
||||||
|
|
||||||
|
**Design choice**: `company_id` being nullable elegantly distinguishes market tasks (available for browsing) from accepted tasks (owned by the company).
|
||||||
|
|
||||||
|
### TaskRequirement (`models/task.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `task_id` | UUID (FK) | References Task |
|
||||||
|
| `domain` | String | Which domain this requirement covers |
|
||||||
|
| `required_qty` | Float | Total work units needed |
|
||||||
|
| `completed_qty` | Float | Work units completed so far |
|
||||||
|
|
||||||
|
**Design choice**: Multi-domain requirements make tasks a multi-dimensional optimization problem. A task might need work in 2-4 domains simultaneously.
|
||||||
|
|
||||||
|
### TaskAssignment (`models/task.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `task_id` | UUID (FK) | References Task |
|
||||||
|
| `employee_id` | UUID (FK) | References Employee |
|
||||||
|
| `assigned_at` | DateTime | When assigned |
|
||||||
|
|
||||||
|
**Design choice**: Many-to-many junction table. An employee can work on multiple tasks (throughput splits), and a task can have multiple employees (parallel progress).
|
||||||
|
|
||||||
|
### SimEvent (`models/event.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `id` | UUID (PK) | Deterministic (uuid5) |
|
||||||
|
| `company_id` | UUID (FK) | References Company |
|
||||||
|
| `event_type` | String | task_completed / bankruptcy / task_half / horizon_end |
|
||||||
|
| `scheduled_at` | DateTime | When event triggers |
|
||||||
|
| `payload` | JSON | Event-specific data |
|
||||||
|
| `dedupe_key` | String | Prevents duplicate events |
|
||||||
|
| `consumed` | Boolean | True after processing |
|
||||||
|
|
||||||
|
### LedgerEntry (`models/ledger.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `id` | UUID (PK) | Auto-generated |
|
||||||
|
| `company_id` | UUID (FK) | References Company |
|
||||||
|
| `occurred_at` | DateTime | Transaction timestamp |
|
||||||
|
| `category` | Enum | MONTHLY_PAYROLL / TASK_REWARD / TASK_FAIL_PENALTY / TASK_CANCEL_PENALTY |
|
||||||
|
| `amount_cents` | BigInteger | Signed amount (negative = cost) |
|
||||||
|
| `ref_type` | String (nullable) | Reference entity type |
|
||||||
|
| `ref_id` | UUID (nullable) | Reference entity ID |
|
||||||
|
|
||||||
|
**Design choice**: Immutable append-only ledger provides a complete financial audit trail. No entries are ever deleted or modified.
|
||||||
|
|
||||||
|
### SimState (`models/sim_state.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `company_id` | UUID (FK, PK) | References Company |
|
||||||
|
| `sim_time` | DateTime | Current simulation clock |
|
||||||
|
| `run_seed` | Integer | RNG seed for reproducibility |
|
||||||
|
| `horizon_end` | DateTime | When simulation ends |
|
||||||
|
| `replenish_counter` | Integer | Tracks market task replenishment |
|
||||||
|
|
||||||
|
### Scratchpad (`models/scratchpad.py`)
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|--------|------|-------|
|
||||||
|
| `company_id` | UUID (FK) | References Company |
|
||||||
|
| `content` | Text | Free-form agent notes |
|
||||||
|
|
||||||
|
**Design choice**: Scratchpad survives LLM context truncation, giving the agent persistent memory across the full simulation.
|
||||||
|
|
||||||
|
## Session Management (`session.py`)
|
||||||
|
|
||||||
|
```python
|
||||||
|
session_scope(factory) → context manager
|
||||||
|
```
|
||||||
|
|
||||||
|
- Creates a scoped session with automatic commit/rollback
|
||||||
|
- Supports both SQLite (default) and PostgreSQL (via `DATABASE_URL`)
|
||||||
|
- `init_db()` creates all tables from ORM metadata
|
||||||
|
|
||||||
|
**Design choice**: Context manager pattern ensures every database operation is properly transacted, preventing partial state updates that would corrupt the simulation.
|
||||||
144
system_design/03_task_system.md
Normal file
144
system_design/03_task_system.md
Normal file
|
|
@ -0,0 +1,144 @@
|
||||||
|
# Task System
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/cli/task_commands.py`, `src/yc_bench/core/eta.py`, `src/yc_bench/core/progress.py`
|
||||||
|
|
||||||
|
## Task Lifecycle
|
||||||
|
|
||||||
|
```
|
||||||
|
market ──accept──> planned ──dispatch──> active ──complete──> completed_success
|
||||||
|
│ │ completed_fail
|
||||||
|
│ │
|
||||||
|
└──cancel──> cancelled <──cancel──┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### States
|
||||||
|
|
||||||
|
| Status | Meaning |
|
||||||
|
|--------|---------|
|
||||||
|
| `market` | Available for browsing, not yet accepted |
|
||||||
|
| `planned` | Accepted by company, employees can be assigned |
|
||||||
|
| `active` | Dispatched, work is progressing |
|
||||||
|
| `completed_success` | Finished on time |
|
||||||
|
| `completed_fail` | Finished late (past deadline) |
|
||||||
|
| `cancelled` | Abandoned by agent |
|
||||||
|
|
||||||
|
## Design Choices
|
||||||
|
|
||||||
|
### Two-Phase Activation (Accept → Dispatch)
|
||||||
|
|
||||||
|
Tasks go through `planned` before `active`. This separation:
|
||||||
|
|
||||||
|
1. **Allows pre-assignment**: Agent can assign employees before starting the clock
|
||||||
|
2. **Deadline starts at accept**: Creates urgency -- planning time counts against the deadline
|
||||||
|
3. **Forces commitment**: Accepting a task reserves it but the agent must still dispatch
|
||||||
|
|
||||||
|
### Deadline Calculation
|
||||||
|
|
||||||
|
```
|
||||||
|
deadline = accepted_at + max(required_qty[d] for all domains d) / deadline_qty_per_day
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Deadline is proportional to the largest single-domain requirement, not the sum. This means multi-domain tasks don't get proportionally more time -- they require parallel work.
|
||||||
|
|
||||||
|
### Prestige Gating at Accept Time
|
||||||
|
|
||||||
|
```python
|
||||||
|
def task_accept(task_id):
|
||||||
|
for domain in task.requirements:
|
||||||
|
if company_prestige[domain] < task.required_prestige:
|
||||||
|
reject("Insufficient prestige in {domain}")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Prestige check is per-domain. A task requiring prestige 3.0 with requirements in `research` and `inference` needs prestige >= 3.0 in BOTH domains. This prevents gaming by maxing one domain.
|
||||||
|
|
||||||
|
### Cancel Penalties
|
||||||
|
|
||||||
|
Cancelling an active task incurs:
|
||||||
|
- Prestige penalty: `reward_prestige_delta × cancel_multiplier` (configurable per difficulty)
|
||||||
|
- No financial penalty (just lost opportunity)
|
||||||
|
|
||||||
|
**Design choice**: Cancel penalties prevent the strategy of accepting everything and dropping what's inconvenient. Higher difficulties increase the cancel multiplier.
|
||||||
|
|
||||||
|
## Employee Assignment
|
||||||
|
|
||||||
|
### Assignment Rules
|
||||||
|
|
||||||
|
- Employees can only be assigned to `planned` or `active` tasks
|
||||||
|
- An employee can work on multiple tasks simultaneously (throughput splits)
|
||||||
|
- Multiple employees can work on the same task (parallel progress)
|
||||||
|
|
||||||
|
### Throughput Splitting
|
||||||
|
|
||||||
|
```
|
||||||
|
effective_rate = base_rate_per_hour / num_active_tasks
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Linear throughput splitting creates a fundamental trade-off:
|
||||||
|
- **Focus**: 1 employee on 1 task = full speed
|
||||||
|
- **Parallel**: 1 employee on 3 tasks = 1/3 speed each
|
||||||
|
- The agent must decide between fast completion of few tasks vs. slow progress on many
|
||||||
|
|
||||||
|
## Progress Tracking (`progress.py`)
|
||||||
|
|
||||||
|
### How Work Gets Done
|
||||||
|
|
||||||
|
Progress is calculated lazily during `advance_time()`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for each active task:
|
||||||
|
for each assigned employee:
|
||||||
|
for each domain in task requirements:
|
||||||
|
work = employee.skill_rate[domain] / num_active_tasks × business_hours
|
||||||
|
requirement.completed_qty += work
|
||||||
|
requirement.completed_qty = min(completed_qty, required_qty)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-Domain Completion
|
||||||
|
|
||||||
|
A task is complete when ALL domain requirements reach `completed_qty >= required_qty`. The slowest domain is the bottleneck.
|
||||||
|
|
||||||
|
**Design choice**: This creates interesting optimization puzzles. If a task needs 100 units of research and 50 units of training, the agent should allocate more research-skilled employees to balance completion times.
|
||||||
|
|
||||||
|
## ETA Solver (`eta.py`)
|
||||||
|
|
||||||
|
### Completion Time Calculation
|
||||||
|
|
||||||
|
```python
|
||||||
|
def solve_task_completion_time(task, assignments, sim_time):
|
||||||
|
for each domain d:
|
||||||
|
remaining = required_qty[d] - completed_qty[d]
|
||||||
|
rate = sum(effective_rate[emp][d] for emp in assignments)
|
||||||
|
if rate == 0:
|
||||||
|
return infinity # no one can work on this domain
|
||||||
|
hours_needed[d] = remaining / rate
|
||||||
|
|
||||||
|
max_hours = max(hours_needed.values())
|
||||||
|
return sim_time + max_hours (in business hours)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Halfway Time Calculation
|
||||||
|
|
||||||
|
Used for milestone events. Finds the time when weighted average across domains reaches 50%.
|
||||||
|
|
||||||
|
### When ETAs Are Recalculated
|
||||||
|
|
||||||
|
- Task dispatched (new active task)
|
||||||
|
- Employee assigned/unassigned
|
||||||
|
- Task completed (frees employee throughput for other tasks)
|
||||||
|
- Task cancelled (same)
|
||||||
|
|
||||||
|
**Design choice**: Dynamic ETA recalculation ensures events are always accurate. When an employee is reassigned, all affected tasks get new completion projections.
|
||||||
|
|
||||||
|
## Market Task Generation
|
||||||
|
|
||||||
|
See [09_configuration.md](09_configuration.md) for details on how market tasks are generated with stratified prestige distribution and randomized requirements.
|
||||||
|
|
||||||
|
### Browsing and Filtering
|
||||||
|
|
||||||
|
The `market browse` command supports:
|
||||||
|
- Domain filter
|
||||||
|
- Prestige range filter
|
||||||
|
- Reward range filter
|
||||||
|
- Pagination (offset/limit)
|
||||||
|
|
||||||
|
All output is JSON for agent consumption.
|
||||||
123
system_design/04_prestige_system.md
Normal file
123
system_design/04_prestige_system.md
Normal file
|
|
@ -0,0 +1,123 @@
|
||||||
|
# Prestige System
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/db/models/company.py` (CompanyPrestige), `src/yc_bench/core/engine.py` (decay), `src/yc_bench/core/handlers/task_complete.py` (rewards/penalties)
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Prestige is YC-Bench's core progression mechanic. It controls access to higher-tier tasks (which offer better rewards) and decays over time, forcing continuous engagement.
|
||||||
|
|
||||||
|
## Design Choices
|
||||||
|
|
||||||
|
### Per-Domain Prestige (4 Independent Tracks)
|
||||||
|
|
||||||
|
```
|
||||||
|
research: ████████░░ (8.0)
|
||||||
|
inference: ██████░░░░ (6.0)
|
||||||
|
data_environment: ███░░░░░░░ (3.0)
|
||||||
|
training: █████░░░░░ (5.0)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why 4 domains?** This creates a 4-dimensional strategic space:
|
||||||
|
- The agent can't max all domains simultaneously (decay + limited employees)
|
||||||
|
- Specialization unlocks high-tier tasks in 1-2 domains
|
||||||
|
- Diversification provides resilience but slower progression
|
||||||
|
- Multi-domain tasks require balanced prestige across their domains
|
||||||
|
|
||||||
|
### Prestige Range: [1.0, 10.0]
|
||||||
|
|
||||||
|
| Level | Meaning |
|
||||||
|
|-------|---------|
|
||||||
|
| 1.0 | Minimum (starting/decayed) |
|
||||||
|
| 3.0-4.0 | Mid-tier tasks accessible |
|
||||||
|
| 7.0-8.0 | High-tier tasks accessible |
|
||||||
|
| 10.0 | Maximum (hard cap) |
|
||||||
|
|
||||||
|
**Design choice**: The 1-10 range is intuitive and provides enough granularity for meaningful gating tiers without over-complicating the system.
|
||||||
|
|
||||||
|
## Prestige Gain
|
||||||
|
|
||||||
|
On successful task completion (on-time):
|
||||||
|
|
||||||
|
```
|
||||||
|
for each domain in task.requirements:
|
||||||
|
company_prestige[domain] += task.reward_prestige_delta
|
||||||
|
company_prestige[domain] = min(prestige, 10.0) # cap
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Prestige gain is per-domain and tied to the task's requirements. Completing a research+inference task only boosts those two domains, not training or data_environment.
|
||||||
|
|
||||||
|
### Prestige Scaling of Rewards
|
||||||
|
|
||||||
|
```
|
||||||
|
actual_reward = base_reward × (1 + reward_prestige_scale × (prestige - 1))
|
||||||
|
```
|
||||||
|
|
||||||
|
Higher prestige in a domain means better financial returns from tasks in that domain. This creates a virtuous cycle: more prestige → more money → more capacity → more prestige.
|
||||||
|
|
||||||
|
## Prestige Loss
|
||||||
|
|
||||||
|
### Decay (Daily)
|
||||||
|
|
||||||
|
```
|
||||||
|
prestige -= decay_per_day × days_elapsed
|
||||||
|
prestige = max(prestige, 1.0) # floor
|
||||||
|
```
|
||||||
|
|
||||||
|
Default decay rate: -0.005/day. This is slow enough to not punish short gaps but fast enough that inactive domains eventually return to baseline.
|
||||||
|
|
||||||
|
**Design choice**: Continuous decay prevents "build once, exploit forever" strategies. The agent must continuously complete tasks in a domain to maintain access.
|
||||||
|
|
||||||
|
### Failure Penalty
|
||||||
|
|
||||||
|
On late task completion:
|
||||||
|
|
||||||
|
```
|
||||||
|
for each domain in task.requirements:
|
||||||
|
company_prestige[domain] -= task.reward_prestige_delta × fail_multiplier
|
||||||
|
company_prestige[domain] = max(prestige, 1.0)
|
||||||
|
```
|
||||||
|
|
||||||
|
Default `fail_multiplier`: 0.8. Late completion costs almost as much prestige as success would have gained.
|
||||||
|
|
||||||
|
### Cancel Penalty
|
||||||
|
|
||||||
|
On task cancellation:
|
||||||
|
|
||||||
|
```
|
||||||
|
for each domain in task.requirements:
|
||||||
|
company_prestige[domain] -= task.reward_prestige_delta × cancel_multiplier
|
||||||
|
company_prestige[domain] = max(prestige, 1.0)
|
||||||
|
```
|
||||||
|
|
||||||
|
Cancel multipliers vary by difficulty (higher on hard/nightmare).
|
||||||
|
|
||||||
|
## Prestige Gating
|
||||||
|
|
||||||
|
Tasks have a `required_prestige` field. At task acceptance:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for domain in task.requirements:
|
||||||
|
if company_prestige[domain] < task.required_prestige:
|
||||||
|
reject() # must meet prestige in ALL task domains
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Per-domain gating means a task with `required_prestige=5.0` and requirements in research + training needs prestige >= 5.0 in BOTH research AND training. This prevents gaming.
|
||||||
|
|
||||||
|
### Stratified Market Tasks
|
||||||
|
|
||||||
|
The first 10 market tasks are always prestige-1 (accessible immediately). Higher prestige tasks are introduced with stratified distribution. This ensures:
|
||||||
|
|
||||||
|
- The agent always has something to work on initially
|
||||||
|
- Progression is visible (new tasks unlock as prestige grows)
|
||||||
|
- No dead-end states where the agent can't accept any task
|
||||||
|
|
||||||
|
## Strategic Implications
|
||||||
|
|
||||||
|
The prestige system creates several key strategic tensions:
|
||||||
|
|
||||||
|
1. **Specialize vs. Diversify**: Focus on 1-2 domains for deep access, or spread across all 4?
|
||||||
|
2. **Risk vs. Reward**: High-prestige tasks pay more but failure costs more prestige
|
||||||
|
3. **Maintenance vs. Growth**: Should the agent keep working in mastered domains (maintenance) or push new ones (growth)?
|
||||||
|
4. **Accept vs. Defer**: Taking a task you might fail risks prestige loss; waiting risks decay
|
||||||
|
|
||||||
|
These tensions make the benchmark more than just "do tasks fast" -- it tests genuine strategic reasoning.
|
||||||
162
system_design/05_financial_model.md
Normal file
162
system_design/05_financial_model.md
Normal file
|
|
@ -0,0 +1,162 @@
|
||||||
|
# Financial Model
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/db/models/ledger.py`, `src/yc_bench/cli/finance_commands.py`, `src/yc_bench/cli/report_commands.py`, `src/yc_bench/core/handlers/`
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The financial model simulates a startup's cash flow: revenue from completed tasks, costs from employee payroll, and penalties for failures. Running out of money triggers bankruptcy and ends the simulation.
|
||||||
|
|
||||||
|
## Design Choices
|
||||||
|
|
||||||
|
### Cents-Based Integer Arithmetic
|
||||||
|
|
||||||
|
All financial values are stored as `BigInteger` in cents:
|
||||||
|
|
||||||
|
```
|
||||||
|
$1,000.00 = 100_000 cents
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why cents?** Floating-point arithmetic introduces rounding errors that compound over hundreds of transactions. Integer cents guarantee exact financial accounting -- critical for a deterministic benchmark.
|
||||||
|
|
||||||
|
### Immutable Append-Only Ledger
|
||||||
|
|
||||||
|
Every financial transaction creates a `LedgerEntry` that is never modified or deleted:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class LedgerEntry:
|
||||||
|
category: MONTHLY_PAYROLL | TASK_REWARD | TASK_FAIL_PENALTY | TASK_CANCEL_PENALTY
|
||||||
|
amount_cents: int # negative for costs, positive for revenue
|
||||||
|
occurred_at: datetime
|
||||||
|
ref_type: str # optional reference to source entity
|
||||||
|
ref_id: UUID # optional reference ID
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why immutable?** An append-only ledger provides:
|
||||||
|
- Complete audit trail for debugging
|
||||||
|
- Ability to reconstruct balance at any point in time
|
||||||
|
- No risk of silent data corruption
|
||||||
|
- Natural fit for the `finance ledger` and `report monthly` CLI commands
|
||||||
|
|
||||||
|
## Revenue Sources
|
||||||
|
|
||||||
|
### Task Rewards
|
||||||
|
|
||||||
|
On successful (on-time) completion:
|
||||||
|
|
||||||
|
```
|
||||||
|
reward = base_reward × (1 + prestige_scale × (avg_prestige - 1))
|
||||||
|
```
|
||||||
|
|
||||||
|
Where `avg_prestige` is averaged across the task's required domains. Higher prestige = higher payouts.
|
||||||
|
|
||||||
|
**Design choice**: Prestige-scaled rewards create a positive feedback loop that mirrors real business dynamics -- reputation leads to better opportunities.
|
||||||
|
|
||||||
|
### Revenue Timing
|
||||||
|
|
||||||
|
Rewards are credited immediately upon task completion (when the `task_completed` event fires with `success=True`).
|
||||||
|
|
||||||
|
## Cost Sources
|
||||||
|
|
||||||
|
### Monthly Payroll
|
||||||
|
|
||||||
|
Payroll is deducted on the **first business day** of each month:
|
||||||
|
|
||||||
|
```
|
||||||
|
total_payroll = sum(employee.salary_cents for all employees)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Monthly payroll creates predictable but unavoidable costs. The agent must maintain positive cash flow to cover it.
|
||||||
|
|
||||||
|
### Salary Bumps
|
||||||
|
|
||||||
|
Each completed task increases salaries:
|
||||||
|
|
||||||
|
```
|
||||||
|
for each assigned employee:
|
||||||
|
salary_cents *= 1.01 # 1% increase per completion
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Compounding salary increases mean success has a hidden cost. Long-running simulations see payroll grow substantially, creating late-game financial pressure even as task rewards scale with prestige.
|
||||||
|
|
||||||
|
### Failure Penalties
|
||||||
|
|
||||||
|
Late task completion incurs no direct financial penalty beyond the missed reward opportunity. However, the prestige loss from failure reduces future reward scaling.
|
||||||
|
|
||||||
|
### Cancel Penalties
|
||||||
|
|
||||||
|
Cancellation may incur a financial penalty depending on configuration (some presets charge a fraction of the reward).
|
||||||
|
|
||||||
|
## Payroll-Event Tie-Breaking
|
||||||
|
|
||||||
|
When payroll and events fall on the same timestamp:
|
||||||
|
|
||||||
|
```
|
||||||
|
Payroll is processed BEFORE events
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: This ordering is critical. If a task completes on the same day as payroll:
|
||||||
|
1. Payroll deducts first (may push funds negative)
|
||||||
|
2. Task completion reward credits (may save from bankruptcy)
|
||||||
|
3. Bankruptcy check happens after both
|
||||||
|
|
||||||
|
This gives the agent the benefit of the doubt -- a task completing on payday can save the company.
|
||||||
|
|
||||||
|
## Bankruptcy
|
||||||
|
|
||||||
|
Bankruptcy triggers when `funds_cents < 0` after payroll processing:
|
||||||
|
|
||||||
|
```python
|
||||||
|
if company.funds_cents < 0:
|
||||||
|
insert_bankruptcy_event(session, company_id, sim_time)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Bankruptcy is checked only after payroll (not after penalties). This simplifies the model and makes payroll the primary survival constraint.
|
||||||
|
|
||||||
|
### Bankruptcy as Terminal State
|
||||||
|
|
||||||
|
Once bankruptcy fires, the simulation ends. There is no recovery mechanic.
|
||||||
|
|
||||||
|
**Why no bailout?** The benchmark tests whether the agent can sustainably manage a business. Allowing recovery would dilute this signal.
|
||||||
|
|
||||||
|
## Financial Reports
|
||||||
|
|
||||||
|
### Ledger Query (`finance ledger`)
|
||||||
|
|
||||||
|
The agent can query the full transaction history with filters:
|
||||||
|
- Category filter
|
||||||
|
- Date range filter
|
||||||
|
- Pagination
|
||||||
|
|
||||||
|
### Monthly P&L (`report monthly`)
|
||||||
|
|
||||||
|
Aggregates transactions by month:
|
||||||
|
|
||||||
|
```
|
||||||
|
Month Revenue Payroll Penalties Net
|
||||||
|
2025-01 $50,000 $30,000 $0 $20,000
|
||||||
|
2025-02 $35,000 $30,300 $5,000 -$300
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Structured financial reporting gives the agent the data it needs to make informed decisions about task selection and resource allocation.
|
||||||
|
|
||||||
|
## Runway Calculation
|
||||||
|
|
||||||
|
The `company status` command includes a runway estimate:
|
||||||
|
|
||||||
|
```
|
||||||
|
runway_months = funds_cents / monthly_payroll_cents
|
||||||
|
```
|
||||||
|
|
||||||
|
This helps the agent gauge urgency. Low runway signals that the agent needs profitable tasks quickly.
|
||||||
|
|
||||||
|
## Difficulty Scaling
|
||||||
|
|
||||||
|
Financial pressure scales with difficulty preset:
|
||||||
|
|
||||||
|
| Preset | Initial Funds | Payroll Pressure | Penalties |
|
||||||
|
|--------|--------------|-----------------|-----------|
|
||||||
|
| tutorial | Very high | Low | Minimal |
|
||||||
|
| easy | High | Moderate | Low |
|
||||||
|
| medium | Moderate | Moderate | Standard |
|
||||||
|
| hard | Low | High | 1.5x |
|
||||||
|
| nightmare | Very low | Very high | 2x |
|
||||||
143
system_design/06_employee_model.md
Normal file
143
system_design/06_employee_model.md
Normal file
|
|
@ -0,0 +1,143 @@
|
||||||
|
# Employee Model
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/db/models/employee.py`, `src/yc_bench/services/generate_employees.py`, `src/yc_bench/core/progress.py`
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Employees are the company's productive resources. Each has a tier, salary, and hidden per-domain skill rates. The agent must figure out who is good at what through observation and assign them optimally.
|
||||||
|
|
||||||
|
## Design Choices
|
||||||
|
|
||||||
|
### Hidden Skill Rates (Information Asymmetry)
|
||||||
|
|
||||||
|
The agent sees:
|
||||||
|
- Employee name, tier (junior/mid/senior), salary
|
||||||
|
- Which tasks they're currently assigned to
|
||||||
|
|
||||||
|
The agent does NOT see:
|
||||||
|
- Per-domain skill rates (`rate_domain_per_hour`)
|
||||||
|
- Actual work output per hour
|
||||||
|
|
||||||
|
**Why hidden?** This is a core benchmark design decision:
|
||||||
|
1. **Tests inference ability**: The agent must infer strengths from task completion patterns
|
||||||
|
2. **Mirrors reality**: Real managers don't have exact productivity metrics for every skill dimension
|
||||||
|
3. **Creates learning opportunity**: Early task assignments serve as "probes" to discover team capabilities
|
||||||
|
4. **Rewards memory**: Agents that remember past performance can make better future assignments
|
||||||
|
|
||||||
|
### Tier System
|
||||||
|
|
||||||
|
| Tier | Typical Rate Range | Salary Range |
|
||||||
|
|------|-------------------|--------------|
|
||||||
|
| junior | Low | Low |
|
||||||
|
| mid | Medium | Medium |
|
||||||
|
| senior | High | High |
|
||||||
|
|
||||||
|
**Design choice**: Tiers provide a rough signal. Seniors are generally better but not always in every domain. A junior might excel in one domain while a senior is mediocre there. The tier-salary correlation creates a cost-benefit trade-off.
|
||||||
|
|
||||||
|
### Per-Domain Skill Rates
|
||||||
|
|
||||||
|
Each employee has 4 skill rates (one per domain):
|
||||||
|
|
||||||
|
```python
|
||||||
|
class EmployeeSkillRate:
|
||||||
|
domain: str # research, inference, data_environment, training
|
||||||
|
rate_domain_per_hour: float # work units produced per business hour
|
||||||
|
```
|
||||||
|
|
||||||
|
Rates are generated from configurable distributions (triangular, beta, etc.) during world seeding. Some employees are specialists (high in one domain, low in others); some are generalists.
|
||||||
|
|
||||||
|
**Design choice**: The 4-rate vector per employee creates a rich assignment optimization space. Optimal assignment requires matching employee strengths to task domain requirements.
|
||||||
|
|
||||||
|
## Throughput Splitting
|
||||||
|
|
||||||
|
When an employee works on multiple active tasks simultaneously:
|
||||||
|
|
||||||
|
```
|
||||||
|
effective_rate = base_rate / num_active_tasks
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Linear splitting (not diminishing returns or context-switching penalties) was chosen for simplicity and predictability. The agent can reason about it without hidden costs.
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
Employee Alice has `research_rate = 2.0/hr`:
|
||||||
|
- Assigned to 1 task: contributes 2.0 research units/hour
|
||||||
|
- Assigned to 3 tasks: contributes 0.67 research units/hour to each
|
||||||
|
|
||||||
|
### Implication for Strategy
|
||||||
|
|
||||||
|
The agent faces a fundamental trade-off:
|
||||||
|
- **Focused assignment**: 1 employee → 1 task = fastest completion but no parallelism
|
||||||
|
- **Spread assignment**: 1 employee → N tasks = slower per task but progress on multiple fronts
|
||||||
|
- **Optimal**: Match the strategy to deadline pressure and task urgency
|
||||||
|
|
||||||
|
## Skill Growth
|
||||||
|
|
||||||
|
On successful task completion, assigned employees get a skill boost:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for each assigned employee:
|
||||||
|
for each domain in task.requirements:
|
||||||
|
skill_rate[domain] *= (1 + task.skill_boost_pct / 100)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Skill growth compounds over time. Early investments in employee development pay off later through faster task completion. This creates a "training vs. exploiting" tension.
|
||||||
|
|
||||||
|
### Salary Bumps (Hidden Cost of Growth)
|
||||||
|
|
||||||
|
Each task completion also increases salaries:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for each assigned employee:
|
||||||
|
salary_cents *= 1.01 # 1% increase
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Salary bumps mean that experienced employees cost more. The agent can't infinitely scale employee productivity without also scaling costs. After many completions, payroll may become a significant burden.
|
||||||
|
|
||||||
|
## Employee Generation (`generate_employees.py`)
|
||||||
|
|
||||||
|
### Process
|
||||||
|
|
||||||
|
1. Generate 10 employees per company (configurable)
|
||||||
|
2. Assign tiers based on configured distribution (e.g., 30% junior, 40% mid, 30% senior)
|
||||||
|
3. For each employee, generate 4 skill rates from per-tier distributions
|
||||||
|
4. Set salary based on tier bracket
|
||||||
|
|
||||||
|
### Distribution Types
|
||||||
|
|
||||||
|
Skill rates are drawn from configurable distributions:
|
||||||
|
- **Triangular**: min/mode/max (default -- creates realistic bell-curve-like distributions)
|
||||||
|
- **Beta**: alpha/beta parameters (useful for skewed distributions)
|
||||||
|
- **Normal**: mean/std (truncated to positive values)
|
||||||
|
- **Uniform**: low/high
|
||||||
|
- **Constant**: fixed value
|
||||||
|
|
||||||
|
**Design choice**: Configurable distributions allow difficulty presets to create different workforce profiles. Tutorial mode might use tight distributions (predictable employees), while nightmare mode uses wide distributions (unpredictable).
|
||||||
|
|
||||||
|
## Employee Visibility to Agent
|
||||||
|
|
||||||
|
The `employee list` CLI command returns:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"employees": [
|
||||||
|
{
|
||||||
|
"id": "uuid",
|
||||||
|
"name": "Alice Chen",
|
||||||
|
"tier": "senior",
|
||||||
|
"salary": "$8,000/mo",
|
||||||
|
"active_tasks": 2
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: no skill rates, no per-domain breakdown, no historical performance. The agent must build this knowledge through experience.
|
||||||
|
|
||||||
|
## Strategic Considerations
|
||||||
|
|
||||||
|
1. **Discovery phase**: Early on, assign different employees to different domain tasks to learn strengths
|
||||||
|
2. **Specialization**: Once strengths are known, match employees to their best domains
|
||||||
|
3. **Load balancing**: Avoid overloading one employee (throughput splitting penalty)
|
||||||
|
4. **Growth investment**: Assign employees to tasks in domains where they need improvement
|
||||||
|
5. **Cost awareness**: Track which employees have had many salary bumps
|
||||||
243
system_design/07_agent_layer.md
Normal file
243
system_design/07_agent_layer.md
Normal file
|
|
@ -0,0 +1,243 @@
|
||||||
|
# Agent Layer
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/agent/`
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The agent layer connects an LLM to the simulation via a tool-use interface. It manages the conversation loop, prompt construction, tool execution, and run state tracking.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────┐
|
||||||
|
│ Agent Loop │
|
||||||
|
│ (loop.py) │
|
||||||
|
├─────────────────────────┤
|
||||||
|
│ ┌──────────┐ ┌──────┐ │
|
||||||
|
│ │ Prompt │ │ Tools │ │
|
||||||
|
│ │ Builder │ │ │ │
|
||||||
|
│ └──────────┘ └──────┘ │
|
||||||
|
├─────────────────────────┤
|
||||||
|
│ LLM Runtime │
|
||||||
|
│ (runtime/) │
|
||||||
|
│ LiteLLM abstraction │
|
||||||
|
├─────────────────────────┤
|
||||||
|
│ Run State / Transcript │
|
||||||
|
│ (run_state.py) │
|
||||||
|
└─────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Design Choices
|
||||||
|
|
||||||
|
### LiteLLM as LLM Abstraction (`runtime/`)
|
||||||
|
|
||||||
|
The agent uses [LiteLLM](https://github.com/BerriAI/litellm) to abstract away vendor differences:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Supports: Anthropic, OpenAI, OpenRouter, Google Gemini, etc.
|
||||||
|
response = litellm.completion(
|
||||||
|
model="anthropic/claude-sonnet-4-20250514",
|
||||||
|
messages=messages,
|
||||||
|
tools=tools,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why LiteLLM?**
|
||||||
|
- Single interface for all major LLM providers
|
||||||
|
- Consistent tool-use format across providers
|
||||||
|
- Easy to benchmark different models on the same scenarios
|
||||||
|
- Handles auth, retries, and format conversion
|
||||||
|
|
||||||
|
### Tool-Use Interface (Not Text Parsing)
|
||||||
|
|
||||||
|
The agent interacts via structured tool calls, not text command parsing:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "run_command",
|
||||||
|
"arguments": {
|
||||||
|
"command": "yc-bench task list --status active"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why tool-use?**
|
||||||
|
- Eliminates parsing ambiguity
|
||||||
|
- Works with all modern LLMs' native tool-use
|
||||||
|
- Structured output from CLI commands (JSON) flows cleanly back
|
||||||
|
- Reduces error rate vs. free-text command generation
|
||||||
|
|
||||||
|
### Available Tools
|
||||||
|
|
||||||
|
#### `run_command`
|
||||||
|
Executes CLI commands in a subprocess. The agent can run any `yc-bench` CLI command.
|
||||||
|
|
||||||
|
```python
|
||||||
|
def run_command(command: str) -> str:
|
||||||
|
"""Execute a yc-bench CLI command and return output."""
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Subprocess execution provides isolation. The agent can't accidentally modify simulation state outside of defined CLI commands.
|
||||||
|
|
||||||
|
#### `python_repl` (Optional)
|
||||||
|
A persistent Python interpreter for calculations and data analysis.
|
||||||
|
|
||||||
|
```python
|
||||||
|
def python_repl(code: str) -> str:
|
||||||
|
"""Execute Python code and return output."""
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Some agents benefit from being able to compute (e.g., calculate optimal assignments, project cash flow). This tool is optional and configurable.
|
||||||
|
|
||||||
|
## Agent Loop (`loop.py`)
|
||||||
|
|
||||||
|
### Main Loop
|
||||||
|
|
||||||
|
```python
|
||||||
|
def run_agent_loop(runtime, session, company_id, cfg):
|
||||||
|
while not terminal:
|
||||||
|
# Build messages (system prompt + history)
|
||||||
|
messages = build_messages(history, context)
|
||||||
|
|
||||||
|
# Call LLM
|
||||||
|
response = runtime.completion(messages, tools)
|
||||||
|
|
||||||
|
# Process tool calls
|
||||||
|
for tool_call in response.tool_calls:
|
||||||
|
result = execute_tool(tool_call)
|
||||||
|
history.append(tool_call, result)
|
||||||
|
|
||||||
|
# Check for terminal conditions
|
||||||
|
if is_terminal(result):
|
||||||
|
break
|
||||||
|
|
||||||
|
# Auto-resume if agent hasn't advanced simulation
|
||||||
|
if turns_since_resume > max_turns_without_resume:
|
||||||
|
force_resume()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Design Choices in the Loop
|
||||||
|
|
||||||
|
#### History Truncation
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Keep only last N turns to fit context window
|
||||||
|
messages = system_prompt + history[-max_history_turns:]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why truncate?** Long simulations generate hundreds of turns. Without truncation, the context would exceed any model's window. The scratchpad CLI command compensates for lost history.
|
||||||
|
|
||||||
|
#### Auto-Resume Forcing
|
||||||
|
|
||||||
|
If the agent doesn't call `yc-bench sim resume` for N turns, the loop forces one:
|
||||||
|
|
||||||
|
```python
|
||||||
|
if turns_since_resume > cfg.loop.max_turns_without_resume:
|
||||||
|
result = execute("yc-bench sim resume")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why force?** Some models get stuck in analysis loops, repeatedly querying state without advancing. Auto-resume prevents infinite loops and ensures forward progress.
|
||||||
|
|
||||||
|
#### Turn Budget
|
||||||
|
|
||||||
|
The loop has a maximum turn count. This prevents runaway agents and bounds benchmark cost.
|
||||||
|
|
||||||
|
## Prompt Construction (`prompt.py`)
|
||||||
|
|
||||||
|
### System Prompt Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Role description ("You are the CEO of an AI startup...")
|
||||||
|
2. Available commands reference
|
||||||
|
3. Current company status summary
|
||||||
|
4. Strategic guidance (domain, prestige, deadlines)
|
||||||
|
5. Constraints and rules
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: The system prompt provides enough context for the agent to understand its role without revealing internal mechanics (like hidden skill rates or exact formulas).
|
||||||
|
|
||||||
|
### Context Building
|
||||||
|
|
||||||
|
Each turn, the prompt may include:
|
||||||
|
- Wake events from the last `sim resume`
|
||||||
|
- Current funds and runway
|
||||||
|
- Active task count and approaching deadlines
|
||||||
|
- Prestige levels
|
||||||
|
|
||||||
|
This contextual information helps the agent make informed decisions without needing to query every turn.
|
||||||
|
|
||||||
|
## Run State (`run_state.py`)
|
||||||
|
|
||||||
|
### Transcript Recording
|
||||||
|
|
||||||
|
Every turn is recorded:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"turn": 42,
|
||||||
|
"messages": [...],
|
||||||
|
"tool_calls": [...],
|
||||||
|
"tool_results": [...],
|
||||||
|
"timestamp": "2025-03-15T10:30:00",
|
||||||
|
"tokens_used": 1500
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Full transcripts enable:
|
||||||
|
- Post-hoc analysis of agent strategy
|
||||||
|
- Debugging agent failures
|
||||||
|
- Benchmark scoring based on decision quality
|
||||||
|
- Comparison across models
|
||||||
|
|
||||||
|
### Output Format
|
||||||
|
|
||||||
|
The final rollout is saved as JSON:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "anthropic/claude-sonnet-4-20250514",
|
||||||
|
"seed": 42,
|
||||||
|
"config": "medium",
|
||||||
|
"outcome": "horizon_end",
|
||||||
|
"final_funds": 250000,
|
||||||
|
"final_prestige": {"research": 7.2, ...},
|
||||||
|
"turns": 187,
|
||||||
|
"transcript": [...]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Command Execution Policy (`commands/`)
|
||||||
|
|
||||||
|
### Command Allowlist
|
||||||
|
|
||||||
|
The agent can only execute `yc-bench` CLI commands. Arbitrary shell commands are blocked.
|
||||||
|
|
||||||
|
**Design choice**: Restricting to the CLI API ensures:
|
||||||
|
- No direct database manipulation
|
||||||
|
- No simulation state bypass
|
||||||
|
- Fair comparison across models
|
||||||
|
- Deterministic state transitions
|
||||||
|
|
||||||
|
### Error Handling
|
||||||
|
|
||||||
|
Invalid commands return structured error messages:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"error": "Task not found", "task_id": "..."}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Structured errors help the agent understand and recover from mistakes, rather than receiving opaque stack traces.
|
||||||
|
|
||||||
|
## Retry and Timeout Logic
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Exponential backoff for LLM API calls
|
||||||
|
for attempt in range(max_retries):
|
||||||
|
try:
|
||||||
|
response = runtime.completion(messages, tools)
|
||||||
|
break
|
||||||
|
except RateLimitError:
|
||||||
|
wait(2 ** attempt)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: LLM APIs are unreliable. Retry logic ensures transient failures don't corrupt benchmark runs.
|
||||||
173
system_design/08_cli_interface.md
Normal file
173
system_design/08_cli_interface.md
Normal file
|
|
@ -0,0 +1,173 @@
|
||||||
|
# CLI Interface
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/cli/`
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The CLI is the agent's sole interface to the simulation. Every command returns structured JSON, enabling reliable parsing by LLMs.
|
||||||
|
|
||||||
|
## Design Choices
|
||||||
|
|
||||||
|
### JSON-Only Output
|
||||||
|
|
||||||
|
All CLI commands return JSON, never free-text:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ yc-bench company status
|
||||||
|
{
|
||||||
|
"company_name": "Nexus AI",
|
||||||
|
"funds": "$150,000.00",
|
||||||
|
"funds_cents": 15000000,
|
||||||
|
"monthly_payroll": "$30,000.00",
|
||||||
|
"runway_months": 5.0,
|
||||||
|
"prestige": {
|
||||||
|
"research": 3.5,
|
||||||
|
"inference": 2.1,
|
||||||
|
"data_environment": 1.0,
|
||||||
|
"training": 4.2
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why JSON?**
|
||||||
|
- Unambiguous parsing by LLMs (vs. formatted tables)
|
||||||
|
- Consistent structure across all commands
|
||||||
|
- Easy to pipe into `python_repl` for analysis
|
||||||
|
- Machine-readable without regex or text parsing
|
||||||
|
|
||||||
|
### Command Group Organization
|
||||||
|
|
||||||
|
| Group | File | Purpose |
|
||||||
|
|-------|------|---------|
|
||||||
|
| `company` | `company_commands.py` | Company status, prestige overview |
|
||||||
|
| `employee` | `employee_commands.py` | Employee listing and details |
|
||||||
|
| `market` | `market_commands.py` | Browse available tasks |
|
||||||
|
| `task` | `task_commands.py` | Task lifecycle (accept/assign/dispatch/cancel/inspect/list) |
|
||||||
|
| `sim` | `sim_commands.py` | Simulation control (resume) |
|
||||||
|
| `finance` | `finance_commands.py` | Ledger queries |
|
||||||
|
| `report` | `report_commands.py` | Monthly P&L reports |
|
||||||
|
| `scratchpad` | `scratchpad_commands.py` | Persistent agent memory |
|
||||||
|
|
||||||
|
**Design choice**: Command groups mirror real business functions (operations, HR, finance, strategy). This makes the interface intuitive for LLM agents that have been trained on business concepts.
|
||||||
|
|
||||||
|
## Command Details
|
||||||
|
|
||||||
|
### Company Commands
|
||||||
|
|
||||||
|
#### `company status`
|
||||||
|
Returns current funds, payroll, runway, and prestige levels per domain.
|
||||||
|
|
||||||
|
**Design choice**: Single command gives the agent a complete financial and strategic snapshot. Reduces the number of API calls needed per decision cycle.
|
||||||
|
|
||||||
|
### Employee Commands
|
||||||
|
|
||||||
|
#### `employee list`
|
||||||
|
Returns all employees with tier, salary, and current active task count.
|
||||||
|
|
||||||
|
**Design choice**: Shows active task count but NOT skill rates. The agent must infer capabilities.
|
||||||
|
|
||||||
|
### Market Commands
|
||||||
|
|
||||||
|
#### `market browse [--domain X] [--min-prestige N] [--max-prestige N] [--offset O] [--limit L]`
|
||||||
|
Browse available market tasks with optional filters.
|
||||||
|
|
||||||
|
**Design choice**: Filtering and pagination prevent information overload. The agent can focus on tasks matching its current prestige level and strategic goals.
|
||||||
|
|
||||||
|
### Task Commands
|
||||||
|
|
||||||
|
#### `task accept <task_id>`
|
||||||
|
Accept a market task. Validates prestige requirements. Sets deadline.
|
||||||
|
|
||||||
|
#### `task assign <task_id> <employee_id>`
|
||||||
|
Assign an employee to a planned/active task. Recalculates ETAs.
|
||||||
|
|
||||||
|
#### `task dispatch <task_id>`
|
||||||
|
Start work on a planned task. Changes status to active.
|
||||||
|
|
||||||
|
#### `task cancel <task_id>`
|
||||||
|
Cancel a task. Applies prestige penalty. Frees employees.
|
||||||
|
|
||||||
|
#### `task inspect <task_id>`
|
||||||
|
Detailed view of a single task: requirements, progress, assignments, deadline.
|
||||||
|
|
||||||
|
#### `task list [--status X]`
|
||||||
|
List company tasks with optional status filter.
|
||||||
|
|
||||||
|
**Design choice**: The accept → assign → dispatch flow gives the agent explicit control over each phase. This mirrors real project management where you scope, staff, and then kick off work.
|
||||||
|
|
||||||
|
### Simulation Commands
|
||||||
|
|
||||||
|
#### `sim resume`
|
||||||
|
Advance simulation to the next event. Returns wake events.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"advanced_to": "2025-02-15T09:00:00",
|
||||||
|
"wake_events": [
|
||||||
|
{"type": "task_completed", "task_id": "...", "success": true},
|
||||||
|
{"type": "payroll", "amount": -3000000}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Resume is the only way to advance time. The agent explicitly chooses when to move forward, creating natural decision checkpoints.
|
||||||
|
|
||||||
|
### Finance Commands
|
||||||
|
|
||||||
|
#### `finance ledger [--category X] [--from DATE] [--to DATE] [--offset O] [--limit L]`
|
||||||
|
Query the immutable transaction history.
|
||||||
|
|
||||||
|
**Design choice**: Full ledger access lets sophisticated agents analyze spending patterns and project future cash flow.
|
||||||
|
|
||||||
|
### Report Commands
|
||||||
|
|
||||||
|
#### `report monthly`
|
||||||
|
Aggregated P&L by month.
|
||||||
|
|
||||||
|
**Design choice**: Monthly reports provide a higher-level financial view than raw ledger entries, useful for strategic planning.
|
||||||
|
|
||||||
|
### Scratchpad Commands
|
||||||
|
|
||||||
|
#### `scratchpad read`
|
||||||
|
Read persistent notes.
|
||||||
|
|
||||||
|
#### `scratchpad write <content>`
|
||||||
|
Overwrite scratchpad contents.
|
||||||
|
|
||||||
|
#### `scratchpad append <content>`
|
||||||
|
Add to existing scratchpad.
|
||||||
|
|
||||||
|
#### `scratchpad clear`
|
||||||
|
Clear scratchpad.
|
||||||
|
|
||||||
|
**Design choice**: The scratchpad is critical for long simulations where LLM context gets truncated. The agent can store:
|
||||||
|
- Employee capability observations
|
||||||
|
- Strategic plans
|
||||||
|
- Financial projections
|
||||||
|
- Task priority lists
|
||||||
|
|
||||||
|
This compensates for context window limitations and tests whether the agent proactively maintains external memory.
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
All commands return structured errors:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"error": "Insufficient prestige in research (have 2.3, need 4.0)"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Descriptive error messages help the agent understand what went wrong and adjust its strategy, rather than failing silently or with cryptic messages.
|
||||||
|
|
||||||
|
## CLI Entry Point (`__main__.py`)
|
||||||
|
|
||||||
|
The CLI uses a command-line parser (likely Click or argparse) to route commands to handler functions. Each handler:
|
||||||
|
|
||||||
|
1. Opens a database session
|
||||||
|
2. Validates inputs
|
||||||
|
3. Performs the operation
|
||||||
|
4. Returns JSON output
|
||||||
|
5. Commits or rolls back the transaction
|
||||||
|
|
||||||
|
**Design choice**: Each CLI call is a self-contained transaction. This prevents partial state updates and ensures the simulation remains consistent.
|
||||||
203
system_design/09_configuration.md
Normal file
203
system_design/09_configuration.md
Normal file
|
|
@ -0,0 +1,203 @@
|
||||||
|
# Configuration System
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/config/`
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The configuration system uses Pydantic models validated from TOML preset files. It controls every aspect of the simulation: world generation parameters, difficulty tuning, agent behavior, and distribution specifications.
|
||||||
|
|
||||||
|
## Design Choices
|
||||||
|
|
||||||
|
### Pydantic Schema (`schema.py`)
|
||||||
|
|
||||||
|
The configuration hierarchy:
|
||||||
|
|
||||||
|
```
|
||||||
|
ExperimentConfig
|
||||||
|
├── AgentConfig # LLM model, tools, retry settings
|
||||||
|
├── LoopConfig # Turn budget, auto-resume threshold
|
||||||
|
├── SimConfig # Simulation parameters
|
||||||
|
└── WorldConfig # World generation parameters
|
||||||
|
├── CompanyConfig # Initial funds, starting prestige
|
||||||
|
├── EmployeeConfig # Team size, tier distribution, salary ranges
|
||||||
|
├── TaskConfig # Task count, domain requirements, deadlines
|
||||||
|
└── PrestigeConfig # Decay rate, penalty multipliers, scaling
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why Pydantic?**
|
||||||
|
- Type validation at load time (catch config errors early)
|
||||||
|
- Default values with optional overrides
|
||||||
|
- Discriminated unions for distribution specs
|
||||||
|
- Clear documentation through type annotations
|
||||||
|
- Serialization to/from TOML/JSON
|
||||||
|
|
||||||
|
### TOML Preset Files (`presets/`)
|
||||||
|
|
||||||
|
```toml
|
||||||
|
# medium.toml
|
||||||
|
[world]
|
||||||
|
initial_funds_cents = 500_000_00
|
||||||
|
|
||||||
|
[world.prestige]
|
||||||
|
decay_per_day = 0.005
|
||||||
|
penalty_fail_multiplier = 0.8
|
||||||
|
penalty_cancel_multiplier = 1.0
|
||||||
|
|
||||||
|
[world.tasks]
|
||||||
|
count = 200
|
||||||
|
deadline_qty_per_day = 11.0
|
||||||
|
|
||||||
|
[world.tasks.reward_funds]
|
||||||
|
type = "triangular"
|
||||||
|
min = 5000_00
|
||||||
|
mode = 15000_00
|
||||||
|
max = 50000_00
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why TOML?** Human-readable, supports comments, natural hierarchy via sections, widely supported in Python. Better than JSON for config files (comments), simpler than YAML (fewer gotchas).
|
||||||
|
|
||||||
|
### Preset Hierarchy
|
||||||
|
|
||||||
|
| Preset | Focus | Key Characteristics |
|
||||||
|
|--------|-------|-------------------|
|
||||||
|
| `default.toml` | Base | All defaults; other presets override selectively |
|
||||||
|
| `tutorial.toml` | Learning | Relaxed deadlines, prestige-1 tasks only, high funds |
|
||||||
|
| `easy.toml` | Casual | Relaxed deadlines, flat prestige requirements |
|
||||||
|
| `medium.toml` | Standard | Prestige climbing, 2-domain tasks, 9-day deadlines |
|
||||||
|
| `hard.toml` | Challenge | Prestige gating active, 7-day deadlines, 1.5x cancel penalty |
|
||||||
|
| `nightmare.toml` | Extreme | Razor-thin margins, 6-day deadlines, 2x penalties |
|
||||||
|
|
||||||
|
**Design choice**: Preset-based difficulty rather than a single "difficulty slider" allows fine-grained control. Each preset can tune dozens of independent parameters.
|
||||||
|
|
||||||
|
### Config Loading (`loader.py`)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def load_config(preset_name: str) -> ExperimentConfig:
|
||||||
|
base = load_toml("default.toml")
|
||||||
|
overlay = load_toml(f"{preset_name}.toml")
|
||||||
|
merged = deep_merge(base, overlay)
|
||||||
|
return ExperimentConfig(**merged)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Config inheritance via deep merge. Presets only specify what differs from default, keeping preset files concise and maintainable.
|
||||||
|
|
||||||
|
## Distribution Specifications (`sampling.py`)
|
||||||
|
|
||||||
|
### The DistSpec System
|
||||||
|
|
||||||
|
Many world generation parameters use statistical distributions rather than fixed values:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class DistSpec(BaseModel):
|
||||||
|
"""Discriminated union of distribution types."""
|
||||||
|
type: Literal["triangular", "beta", "normal", "uniform", "constant"]
|
||||||
|
# Parameters vary by type
|
||||||
|
```
|
||||||
|
|
||||||
|
**Supported distributions:**
|
||||||
|
|
||||||
|
| Type | Parameters | Use Case |
|
||||||
|
|------|-----------|----------|
|
||||||
|
| `triangular` | min, mode, max | Task rewards, skill rates (natural asymmetric bell curve) |
|
||||||
|
| `beta` | alpha, beta, scale | Prestige requirements (skewed toward low values) |
|
||||||
|
| `normal` | mean, std | Symmetric variation around a target |
|
||||||
|
| `uniform` | low, high | Equal probability across range |
|
||||||
|
| `constant` | value | Fixed value (no randomness) |
|
||||||
|
|
||||||
|
**Why discriminated unions?** Pydantic validates the correct parameters for each distribution type at load time. Invalid combinations (e.g., triangular with alpha parameter) are caught before the simulation runs.
|
||||||
|
|
||||||
|
### Usage Example
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[world.tasks.reward_funds]
|
||||||
|
type = "triangular"
|
||||||
|
min = 5000_00
|
||||||
|
mode = 15000_00
|
||||||
|
max = 50000_00
|
||||||
|
|
||||||
|
[world.employees.junior_rate]
|
||||||
|
type = "beta"
|
||||||
|
alpha = 2.0
|
||||||
|
beta = 5.0
|
||||||
|
scale = 3.0
|
||||||
|
```
|
||||||
|
|
||||||
|
## World Generation
|
||||||
|
|
||||||
|
### Seeding (`services/seed_world.py`)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def seed_world_transactional(session, cfg, seed):
|
||||||
|
rng = create_rng(seed)
|
||||||
|
company = create_company(session, cfg.world.company)
|
||||||
|
employees = generate_employees(session, company, cfg.world.employees, rng)
|
||||||
|
tasks = generate_tasks(session, cfg.world.tasks, rng)
|
||||||
|
sim_state = create_sim_state(session, company, cfg.sim, seed)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Single-transaction world seeding ensures atomic creation. Either the entire world is created or nothing is -- no partial states.
|
||||||
|
|
||||||
|
### Employee Generation (`services/generate_employees.py`)
|
||||||
|
|
||||||
|
1. Generate N employees (default 10)
|
||||||
|
2. Assign tiers from configured distribution (e.g., 30/40/30 junior/mid/senior)
|
||||||
|
3. For each employee, sample 4 skill rates from per-tier distributions
|
||||||
|
4. Set salary based on tier range
|
||||||
|
|
||||||
|
### Task Generation (`services/generate_tasks.py`)
|
||||||
|
|
||||||
|
1. Generate M tasks (default 200+)
|
||||||
|
2. First 10 tasks are always prestige-1 (guaranteed accessible)
|
||||||
|
3. Remaining tasks have stratified prestige requirements
|
||||||
|
4. Each task gets 2-4 domain requirements sampled from distributions
|
||||||
|
5. Rewards scale with prestige and task size
|
||||||
|
|
||||||
|
**Design choice**: Stratified generation ensures:
|
||||||
|
- The agent always has starting tasks (prestige-1 guaranteed)
|
||||||
|
- Tasks span the full prestige range (progression is possible)
|
||||||
|
- No prestige "dead zones" where no tasks exist
|
||||||
|
|
||||||
|
### RNG Management (`services/rng.py`)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def create_rng(seed: int) -> numpy.random.Generator:
|
||||||
|
return numpy.random.default_rng(seed)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Centralized RNG with explicit seed ensures full reproducibility. Same seed → same world → same event sequence (given same agent actions).
|
||||||
|
|
||||||
|
## Key Configuration Parameters
|
||||||
|
|
||||||
|
### Financial Tuning
|
||||||
|
|
||||||
|
| Parameter | Default | Effect |
|
||||||
|
|-----------|---------|--------|
|
||||||
|
| `initial_funds_cents` | 500,000 | Starting capital |
|
||||||
|
| `reward_prestige_scale` | 0.15 | How much prestige amplifies rewards |
|
||||||
|
| `salary_bump_pct` | 1.0 | Per-completion salary increase |
|
||||||
|
|
||||||
|
### Prestige Tuning
|
||||||
|
|
||||||
|
| Parameter | Default | Effect |
|
||||||
|
|-----------|---------|--------|
|
||||||
|
| `prestige_decay_per_day` | 0.005 | Daily prestige loss |
|
||||||
|
| `penalty_fail_multiplier` | 0.8 | Prestige cost of late completion |
|
||||||
|
| `penalty_cancel_multiplier` | 1.0 | Prestige cost of cancellation |
|
||||||
|
| `prestige_min` | 1.0 | Floor value |
|
||||||
|
| `prestige_max` | 10.0 | Ceiling value |
|
||||||
|
|
||||||
|
### Task Tuning
|
||||||
|
|
||||||
|
| Parameter | Default | Effect |
|
||||||
|
|-----------|---------|--------|
|
||||||
|
| `deadline_qty_per_day` | 11.0 | Deadline generosity |
|
||||||
|
| `num_domains_per_task` | 2-4 | Multi-domain complexity |
|
||||||
|
| `progress_milestone_pct` | 50 | When to fire halfway event |
|
||||||
|
|
||||||
|
### Agent Tuning
|
||||||
|
|
||||||
|
| Parameter | Default | Effect |
|
||||||
|
|-----------|---------|--------|
|
||||||
|
| `max_turns` | 500 | Hard turn limit |
|
||||||
|
| `max_turns_without_resume` | 5 | Auto-resume threshold |
|
||||||
|
| `history_truncation` | 50 | Turns kept in context |
|
||||||
232
system_design/10_runner_orchestration.md
Normal file
232
system_design/10_runner_orchestration.md
Normal file
|
|
@ -0,0 +1,232 @@
|
||||||
|
# Runner & Orchestration
|
||||||
|
|
||||||
|
**Location**: `src/yc_bench/runner/`
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The runner is the top-level orchestration layer that ties everything together: parsing arguments, loading configuration, initializing the database, seeding the world, starting the agent loop, and collecting results.
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
### Entry Point (`main.py`)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def run_benchmark(args):
|
||||||
|
# 1. Load configuration
|
||||||
|
cfg = load_config(args.config)
|
||||||
|
|
||||||
|
# 2. Initialize database
|
||||||
|
engine, factory = init_db(db_path)
|
||||||
|
|
||||||
|
# 3. Seed world
|
||||||
|
with session_scope(factory) as session:
|
||||||
|
seed_world_transactional(session, cfg, args.seed)
|
||||||
|
|
||||||
|
# 4. Build agent runtime
|
||||||
|
runtime = build_runtime(cfg.agent, args.model)
|
||||||
|
|
||||||
|
# 5. Start dashboard (if TTY)
|
||||||
|
dashboard = Dashboard(cfg) if is_tty() else None
|
||||||
|
|
||||||
|
# 6. Run agent loop
|
||||||
|
result = run_agent_loop(runtime, factory, cfg, dashboard)
|
||||||
|
|
||||||
|
# 7. Save results
|
||||||
|
save_rollout(result, args.output)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Design Choices
|
||||||
|
|
||||||
|
#### Single-Command Invocation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run yc-bench run --model gemini/gemini-3-flash --seed 1 --config medium
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why single command?** Benchmarks should be easy to reproduce. One command with explicit parameters (model, seed, config) fully specifies a run.
|
||||||
|
|
||||||
|
#### Database Per Run
|
||||||
|
|
||||||
|
Each run creates a fresh SQLite database:
|
||||||
|
|
||||||
|
```
|
||||||
|
db/run_seed1_medium_2025-03-15.sqlite
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why per-run databases?**
|
||||||
|
- Isolation: runs can't interfere with each other
|
||||||
|
- Inspection: can analyze any run's final state after the fact
|
||||||
|
- Reproducibility: re-running with same seed produces identical database
|
||||||
|
- Parallelism: multiple runs can execute simultaneously
|
||||||
|
|
||||||
|
## Argument Parsing (`args.py`)
|
||||||
|
|
||||||
|
### Key Arguments
|
||||||
|
|
||||||
|
| Argument | Required | Description |
|
||||||
|
|----------|----------|-------------|
|
||||||
|
| `--model` | Yes | LLM model identifier (LiteLLM format) |
|
||||||
|
| `--seed` | Yes | Random seed for world generation |
|
||||||
|
| `--config` | No | Difficulty preset (default: "medium") |
|
||||||
|
| `--output` | No | Output path for rollout JSON |
|
||||||
|
| `--no-dashboard` | No | Disable live terminal UI |
|
||||||
|
| `--max-turns` | No | Override turn limit |
|
||||||
|
|
||||||
|
**Design choice**: Required arguments are minimal (model + seed). Everything else has sensible defaults. This reduces barrier to running benchmarks while allowing full customization.
|
||||||
|
|
||||||
|
## Dashboard (`dashboard.py`)
|
||||||
|
|
||||||
|
### Live Terminal UI
|
||||||
|
|
||||||
|
The dashboard uses [Rich](https://github.com/Textualize/rich) to display real-time simulation state:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─ YC-Bench Dashboard ──────────────────────────────┐
|
||||||
|
│ Model: claude-sonnet-4 Seed: 42 Config: medium │
|
||||||
|
│ Turn: 87/500 Sim Time: 2025-06-15 │
|
||||||
|
├────────────────────────────────────────────────────┤
|
||||||
|
│ Funds: $125,340 Runway: 4.2 months │
|
||||||
|
│ Prestige: R:5.2 I:3.8 D:2.1 T:6.4 │
|
||||||
|
│ Active Tasks: 3 Completed: 12 Failed: 1 │
|
||||||
|
├────────────────────────────────────────────────────┤
|
||||||
|
│ Last Action: task assign abc123 emp456 │
|
||||||
|
│ Last Event: task_completed (success) │
|
||||||
|
└────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: The dashboard is for human observers, not the agent. It provides real-time visibility into benchmark runs without affecting agent behavior.
|
||||||
|
|
||||||
|
### Features
|
||||||
|
|
||||||
|
- Live fund tracking with trend indicators
|
||||||
|
- Prestige levels per domain
|
||||||
|
- Task status counters
|
||||||
|
- Recent agent actions
|
||||||
|
- Turn counter and simulation clock
|
||||||
|
- Auto-refreshes on each turn
|
||||||
|
|
||||||
|
### Conditional Activation
|
||||||
|
|
||||||
|
Dashboard only activates when running in a TTY (interactive terminal). Redirected output or CI environments get plain log output.
|
||||||
|
|
||||||
|
**Why conditional?** Batch runs (scripts/) shouldn't have terminal UI overhead. Detecting TTY ensures the right output mode automatically.
|
||||||
|
|
||||||
|
## Session Management (`session.py`)
|
||||||
|
|
||||||
|
### Run Session
|
||||||
|
|
||||||
|
Manages the lifecycle of a single benchmark run:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class RunSession:
|
||||||
|
db_path: str
|
||||||
|
config: ExperimentConfig
|
||||||
|
model: str
|
||||||
|
seed: int
|
||||||
|
start_time: datetime
|
||||||
|
|
||||||
|
def save_rollout(self, result):
|
||||||
|
"""Save final rollout JSON to results/"""
|
||||||
|
|
||||||
|
def cleanup(self):
|
||||||
|
"""Clean up temporary resources"""
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Session object encapsulates all run-specific state, making it easy to serialize and manage runs.
|
||||||
|
|
||||||
|
## Batch Running (`scripts/`)
|
||||||
|
|
||||||
|
### Multi-Seed Runs
|
||||||
|
|
||||||
|
Scripts for running the same model across multiple seeds:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run seeds 1-10 with claude-sonnet on medium difficulty
|
||||||
|
for seed in $(seq 1 10); do
|
||||||
|
uv run yc-bench run --model anthropic/claude-sonnet-4-20250514 --seed $seed --config medium
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-Model Comparison
|
||||||
|
|
||||||
|
Scripts for comparing models on the same seeds:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
for model in "anthropic/claude-sonnet-4-20250514" "openai/gpt-4o" "google/gemini-pro"; do
|
||||||
|
uv run yc-bench run --model $model --seed 42 --config medium
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
**Design choice**: Simple shell scripts rather than a complex orchestration framework. This keeps the benchmark tooling minimal and transparent.
|
||||||
|
|
||||||
|
## Results & Output
|
||||||
|
|
||||||
|
### Rollout JSON
|
||||||
|
|
||||||
|
Each run produces a rollout file:
|
||||||
|
|
||||||
|
```
|
||||||
|
results/
|
||||||
|
├── claude-sonnet_seed1_medium.json
|
||||||
|
├── claude-sonnet_seed2_medium.json
|
||||||
|
├── gpt-4o_seed1_medium.json
|
||||||
|
└── ...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rollout Contents
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"metadata": {
|
||||||
|
"model": "anthropic/claude-sonnet-4-20250514",
|
||||||
|
"seed": 1,
|
||||||
|
"config": "medium",
|
||||||
|
"start_time": "2025-03-15T10:00:00",
|
||||||
|
"end_time": "2025-03-15T10:45:00"
|
||||||
|
},
|
||||||
|
"outcome": "horizon_end",
|
||||||
|
"final_state": {
|
||||||
|
"funds_cents": 25000000,
|
||||||
|
"prestige": {"research": 7.2, "inference": 5.1, ...},
|
||||||
|
"tasks_completed": 24,
|
||||||
|
"tasks_failed": 3,
|
||||||
|
"tasks_cancelled": 1,
|
||||||
|
"turns_used": 187
|
||||||
|
},
|
||||||
|
"transcript": [
|
||||||
|
{"turn": 1, "action": "company status", "result": {...}},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Plots (`plots/`)
|
||||||
|
|
||||||
|
Visualization scripts for comparing model performance:
|
||||||
|
- Funds over time
|
||||||
|
- Prestige progression per domain
|
||||||
|
- Task completion rates
|
||||||
|
- Comparison charts across models/seeds
|
||||||
|
|
||||||
|
**Design choice**: Separate plotting from the benchmark runner. Results are stored as data (JSON); visualization is a post-processing step.
|
||||||
|
|
||||||
|
## Error Recovery
|
||||||
|
|
||||||
|
### Crash Recovery
|
||||||
|
|
||||||
|
If a run crashes (LLM timeout, OOM, etc.):
|
||||||
|
- The SQLite database persists with the last consistent state
|
||||||
|
- Rollout JSON may be partial but includes transcript up to the crash
|
||||||
|
- Re-running with the same seed starts fresh (no resume from crash)
|
||||||
|
|
||||||
|
**Design choice**: No crash recovery by design. Benchmark runs should be atomic -- either complete or re-run. This prevents partial results from contaminating comparisons.
|
||||||
|
|
||||||
|
### Graceful Shutdown
|
||||||
|
|
||||||
|
On SIGINT (Ctrl+C):
|
||||||
|
- Current turn completes
|
||||||
|
- Partial rollout is saved
|
||||||
|
- Database is committed
|
||||||
|
- Dashboard is cleaned up
|
||||||
|
|
||||||
|
**Design choice**: Graceful shutdown preserves whatever data exists, useful for debugging long runs that need to be interrupted.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue