collinear-ai/yc-bench

Fork 0

mirror of https://github.com/collinear-ai/yc-bench.git synced 2026-04-19 12:58:03 +00:00

Muyu He ef7c64b5cb Updated design mds

2026-03-19 18:39:57 -07:00

5.7 KiB

Raw Blame History

CLI Interface

Location: src/yc_bench/cli/

Overview

The CLI is the agent's sole interface to the simulation. Every command returns structured JSON, enabling reliable parsing by LLMs.

Design Choices

JSON-Only Output

All CLI commands return JSON, never free-text:

$ yc-bench company status
{
  "company_name": "Nexus AI",
  "funds": "$150,000.00",
  "funds_cents": 15000000,
  "monthly_payroll": "$30,000.00",
  "runway_months": 5.0,
  "prestige": {
    "research": 3.5,
    "inference": 2.1,
    "data_environment": 1.0,
    "training": 4.2
  }
}

Why JSON?

Unambiguous parsing by LLMs (vs. formatted tables)
Consistent structure across all commands
Easy to pipe into python_repl for analysis
Machine-readable without regex or text parsing

Command Group Organization

Group	File	Purpose
`company`	`company_commands.py`	Company status, prestige overview
`employee`	`employee_commands.py`	Employee listing and details
`market`	`market_commands.py`	Browse available tasks
`task`	`task_commands.py`	Task lifecycle (accept/assign/dispatch/cancel/inspect/list)
`sim`	`sim_commands.py`	Simulation control (resume)
`finance`	`finance_commands.py`	Ledger queries
`report`	`report_commands.py`	Monthly P&L reports
`scratchpad`	`scratchpad_commands.py`	Persistent agent memory

Design choice: Command groups mirror real business functions (operations, HR, finance, strategy). This makes the interface intuitive for LLM agents that have been trained on business concepts.

Command Details

Company Commands

`company status`

Returns current funds, payroll, runway, and prestige levels per domain.

Design choice: Single command gives the agent a complete financial and strategic snapshot. Reduces the number of API calls needed per decision cycle.

Employee Commands

`employee list`

Returns all employees with tier, salary, and current active task count.

Design choice: Shows active task count but NOT skill rates. The agent must infer capabilities.

Market Commands

`market browse [--domain X] [--reward-min-cents N] [--offset O] [--limit L]`

Browse available market tasks with optional filters. Results are capped at market_browse_default_limit (default 50) per page.

The browse auto-filters by prestige and trust: only tasks the company can actually accept are shown. This means:

Per-domain prestige check: all required domains must meet the task's required_prestige
Trust check: company must have sufficient trust with the task's client

Design choice: Auto-filtering prevents the agent from wasting turns trying to accept inaccessible tasks. Pagination (--offset) allows browsing beyond the first page.

Task Commands

`task accept <task_id>`

Accept a market task. Validates prestige requirements. Sets deadline.

`task assign <task_id> <employee_id>`

Assign an employee to a planned/active task. Recalculates ETAs.

`task dispatch <task_id>`

Start work on a planned task. Changes status to active.

`task cancel <task_id>`

Cancel a task. Applies prestige penalty. Frees employees.

`task inspect <task_id>`

Detailed view of a single task: requirements, progress, assignments, deadline.

`task list [--status X]`

List company tasks with optional status filter.

Design choice: The accept → assign → dispatch flow gives the agent explicit control over each phase. This mirrors real project management where you scope, staff, and then kick off work.

Simulation Commands

`sim resume`

Advance simulation to the next event. Returns wake events.

{
  "advanced_to": "2025-02-15T09:00:00",
  "wake_events": [
    {"type": "task_completed", "task_id": "...", "success": true},
    {"type": "payroll", "amount": -3000000}
  ]
}

Design choice: Resume is the only way to advance time. The agent explicitly chooses when to move forward, creating natural decision checkpoints.

Finance Commands

`finance ledger [--category X] [--from DATE] [--to DATE] [--offset O] [--limit L]`

Query the immutable transaction history.

Design choice: Full ledger access lets sophisticated agents analyze spending patterns and project future cash flow.

Report Commands

`report monthly`

Aggregated P&L by month.

Design choice: Monthly reports provide a higher-level financial view than raw ledger entries, useful for strategic planning.

Scratchpad Commands

`scratchpad read`

Read persistent notes.

`scratchpad write <content>`

Overwrite scratchpad contents.

`scratchpad append <content>`

Add to existing scratchpad.

`scratchpad clear`

Clear scratchpad.

Design choice: The scratchpad is critical for long simulations where LLM context gets truncated. The agent can store:

Employee capability observations
Strategic plans
Financial projections
Task priority lists

This compensates for context window limitations and tests whether the agent proactively maintains external memory.

Error Handling

All commands return structured errors:

{
  "error": "Insufficient prestige in research (have 2.3, need 4.0)"
}

Design choice: Descriptive error messages help the agent understand what went wrong and adjust its strategy, rather than failing silently or with cryptic messages.

CLI Entry Point (`main.py`)

The CLI uses a command-line parser (likely Click or argparse) to route commands to handler functions. Each handler:

Opens a database session
Validates inputs
Performs the operation
Returns JSON output
Commits or rolls back the transaction

Design choice: Each CLI call is a self-contained transaction. This prevents partial state updates and ensures the simulation remains consistent.

5.7 KiB Raw Blame History

CLI Interface

Overview

Design Choices

JSON-Only Output

Command Group Organization

Command Details

Company Commands

company status

Employee Commands

employee list

Market Commands

market browse [--domain X] [--reward-min-cents N] [--offset O] [--limit L]

Task Commands

task accept <task_id>

task assign <task_id> <employee_id>

task dispatch <task_id>

task cancel <task_id>

task inspect <task_id>

task list [--status X]