mirror of https://github.com/collinear-ai/yc-bench.git synced 2026-04-19 12:58:03 +00:00

Muyu He eb18c5a90c Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4

2026-03-05 18:12:48 -08:00

6.7 KiB

Raw Blame History

YC-Bench

A long-horizon deterministic benchmark for LLM agents. The agent plays CEO of an AI startup over a simulated 1–3 year run, operating exclusively through a CLI tool against a SQLite-backed discrete-event simulation.

The benchmark tests whether agents can manage compounding decisions: prestige specialisation, employee allocation, cash flow, and deadline risk — sustained over hundreds of turns.

Setup

Prerequisites

Python 3.12+
uv

Install

git clone <repo-url>
cd YC_Bench
uv sync

API key

# .env  (any LiteLLM-compatible provider)
ANTHROPIC_API_KEY="sk-ant-..."     # for anthropic/claude-*
GEMINI_API_KEY="AIza..."           # for gemini/gemini-*
OPENROUTER_API_KEY="sk-or-v1-..."  # for openrouter/*
OPENAI_API_KEY="sk-..."            # for openai/*

Run

uv run yc-bench run \
  --model gemini/gemini-3-flash-preview \
  --seed 1 \
  --config medium

Outputs a SQLite DB in db/ and a JSON rollout in results/.

Run multiple models in parallel

bash scripts/run_benchmark.sh --seed 1 --config challenge

How it works

Core loop

Agent calls yc-bench sim resume to advance time to the next event.
The engine flushes task progress, fires due events, applies payroll.
Agent reads wake events and decides: accept tasks, assign employees, dispatch, cancel.
Repeat until bankruptcy or horizon end.

The simulation ends on bankruptcy (funds < 0 after payroll), horizon end (1–3 years), or max turns (if configured). If the agent doesn't call sim resume for 10 consecutive turns, the loop forces one automatically.

Key mechanics

Funds: start at $250K. Monthly payroll is deducted automatically. Task rewards scale with prestige (base × (1 + 0.55 × (prestige − 1))).
4 domains: research · inference · data/environment · training. Each domain tracks prestige independently in [1.0, 10.0].
Prestige gating: tasks require a minimum prestige level. Most tasks need prestige 3–5, so the agent must climb from 1.0 by completing easier tasks first. First 10 market tasks are stratified [1,1,1,1,2,2,2,3,3,4] to bootstrap progression.
Employees: 10 employees across 3 tiers (junior/mid/senior). The agent sees only each employee's tier and salary — not their per-domain skill rates. A junior can secretly be a superstar in one domain, so the agent must infer productivity from task progress observations.
Throughput splitting: an employee assigned to N active tasks has effective_rate = base_rate / N. Focus beats breadth.
Task success: on-time completion awards funds + prestige + skill boosts + 1% salary bump (compounding payroll pressure). Late completion penalises prestige (1.4×). Cancellation penalises harder (2.0×).
Progress checkpoints: the agent is woken at 25%, 50%, 75%, and 100% completion — providing data points to estimate employee productivity.
Scratchpad: persistent notes in the DB that survive context truncation (only last 20 conversation rounds are kept).

Agent CLI

All commands return JSON. The agent interacts via run_command("yc-bench <cmd>").

# Observe
yc-bench company status                          # funds, prestige, runway
yc-bench employee list                           # tier, salary, active tasks
yc-bench market browse [--domain X] [--limit N]  # available tasks
yc-bench task list [--status X]                  # your tasks
yc-bench task inspect --task-id UUID             # progress, deadline, assignments
yc-bench finance ledger                          # transaction history
yc-bench report monthly                          # P&L per month

# Act
yc-bench task accept --task-id UUID              # pull from market
yc-bench task assign --task-id UUID --employee-id UUID
yc-bench task dispatch --task-id UUID            # start work
yc-bench task cancel --task-id UUID --reason ""  # cancel (2× prestige penalty)
yc-bench sim resume                              # advance time
yc-bench scratchpad write/append/clear           # persistent memory

Configuration

Experiment presets live in src/yc_bench/config/presets/ as TOML files. Pass the preset name via --config.

Config	Employees	Tasks	Tests
tutorial	3	50	Basic accept→assign→dispatch loop
easy	5	100	Throughput awareness
medium	5	150	Prestige climbing + domain specialization
hard	7	200	Precise ETA reasoning
nightmare	8	300	Sustained perfection under compounding payroll

See default.toml for the full list of tunable parameters.

Benchmark results

Sonnet 4.6 vs Gemini 3 Flash vs GPT-5.2 — 1-year horizon, 3 seeds per config

Survival rates

Config	Sonnet 4.6	Gemini 3 Flash	GPT-5.2
medium	3/3	3/3	3/3
hard	1/3	2/3	2/3
nightmare	1/3	3/3	2/3

Final funds (bankrupt = funds < 0)

Config	Seed	Sonnet 4.6	Gemini 3 Flash	GPT-5.2
medium	1	$9.1M	$9.5M	$1.8M
medium	2	$6.1M	$11.0M	$321K
medium	3	$107K	$15.8M	$28K
hard	1	bankrupt	bankrupt	bankrupt
hard	2	$63K	$412K	$15.7M
hard	3	bankrupt	$21.9M	$43.5M
nightmare	1	bankrupt	$2.1M	bankrupt
nightmare	2	$10.1M	$214K	$2.2M
nightmare	3	bankrupt	$805K	$23.6M

Overall: Gemini 8/9 · GPT-5.2 7/9 · Sonnet 5/9

Key findings

Gemini leads on consistency (8/9 survival). The only model to sweep all 3 nightmare seeds.
GPT-5.2 has the highest ceiling. Hard seed 3: $43.5M vs Gemini's $21.9M. When it survives, it tends to outperform by a wide margin.
Sonnet is high-variance. Nightmare seed 2: $10.1M (best nightmare result), but 4/9 bankruptcies overall.
Win rate predicts survival. Every run with >58% task win rate survived. Every run below 40% went bankrupt.

Prestige specialization

Please cite our work if you find it useful!

@misc{collinear-ai2025ycbench,
  author       = {{Collinear AI}},
  title        = {{YC-Bench}: Your Company Bench — A Long-Horizon Coherence Benchmark for {LLM} Agents},
  year         = {2025},
  howpublished = {\url{https://github.com/collinear-ai/yc-bench}},
  note         = {Accessed: 2026-02-25}
}

6.7 KiB Raw Blame History Unescape Escape