YC-Bench [![Website](https://img.shields.io/badge/Website-YC--Bench-E8864A)](https://collinear-ai.github.io/yc-bench/) [![Python 3.12+](https://img.shields.io/badge/Python-3.12%2B-3776AB?logo=python&logoColor=white)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) A long-horizon deterministic benchmark for LLM agents. The agent operates a simulated AI startup over a one-year horizon, starting with $200,000 in funds, interacting exclusively through a CLI against a SQLite-backed discrete-event simulation. The benchmark tests whether agents can manage compounding decisions: task selection, employee allocation, client trust, cash flow, and adversarial client detection — sustained over hundreds of turns.

YC-Bench System Architecture

## How it works ### Core loop 1. Agent calls `sim resume` to advance the clock to the next event (task checkpoint, payroll, or horizon end). 2. The engine processes task progress, fires due events, and deducts monthly payroll. 3. Agent receives a status summary with events since the last turn, then issues observe and act commands. 4. Repeat until bankruptcy (funds < 0) or the one-year horizon ends. Between time advances, the agent may issue arbitrarily many actions within a single turn. Work progresses only during business hours (weekdays), and payroll is deducted on the first business day of each month. ### Key mechanics - **Tasks and domains**: The agent earns revenue by completing tasks from a marketplace. Each task belongs to one of four domains — `training · inference · research · data engineering` — and is issued by a client. Tasks have a reward, a deadline (activated on acceptance), and a work quantity employees must complete. Higher prestige unlocks higher-reward tasks and scales their payout. Failing a deadline incurs a 35% penalty of the reward and a prestige reduction. - **Employees**: A fixed roster across 3 tiers (junior/mid/senior) with per-domain productivity levels queryable via `employee list`. Productivity distributions are spiky — a senior may have high throughput in training but low in research. Successful completions grant a productivity boost in that domain but also a salary bump, so payroll grows monotonically. - **Clients and trust**: Completing tasks for a client builds trust, which reduces future work requirements and unlocks higher-tier tasks. However, completing for one client slightly decays trust with all others. - **Adversarial clients**: A subset of clients are adversarial — after acceptance, they inflate work quantities, making deadlines nearly impossible. Adversarial status is hidden. These clients offer competitively high rewards, so the agent must infer adversarial behavior from repeated failures. - **Memory**: Conversation history is truncated to the most recent 20 turns. The agent can write to a persistent scratchpad injected into the system prompt every turn — its sole mechanism for retaining information across context truncation. ### Agent CLI All commands return JSON. The agent interacts via `run_command("yc-bench ")`. | Category | Command | Effect | |----------|---------|--------| | Observe | `company status` | Funds, prestige, payroll | | Observe | `employee list` | Names, tiers, salaries, productivity | | Observe | `market browse` | Available tasks with client, reward, domains | | Observe | `task list` | Accepted tasks with status and progress | | Observe | `task inspect --task-id T` | Per-domain progress, deadline, assignments | | Observe | `client list` | Client trust levels and tiers | | Observe | `client history` | Per-client success/failure counts | | Observe | `finance ledger` | Full transaction history | | Task | `task accept --task-id T` | Accept from market; starts deadline | | Task | `task assign --task-id T --employees E` | Assign employees to task | | Task | `task dispatch --task-id T` | Begin work on assigned task | | Task | `task cancel --task-id T --reason R` | Abandon task; prestige penalty | | Sim | `sim resume` | Advance clock to next event | | Memory | `scratchpad write --content C` | Overwrite persistent notes | | Memory | `scratchpad append --content C` | Append to persistent notes | --- ## Setup ### Prerequisites - Python 3.12+ - [`uv`](https://github.com/astral-sh/uv) ### Install ```bash git clone https://github.com/collinear-ai/yc-bench.git cd yc-bench uv sync ``` ### API key ```bash # .env (any LiteLLM-compatible provider) ANTHROPIC_API_KEY="sk-ant-..." # for anthropic/claude-* GEMINI_API_KEY="AIza..." # for gemini/gemini-* OPENROUTER_API_KEY="sk-or-v1-..." # for openrouter/* OPENAI_API_KEY="sk-..." # for openai/* ``` ### Run ```bash uv run yc-bench run \ --model gemini/gemini-3-flash-preview \ --seed 1 \ --config default ``` Outputs a SQLite DB in `db/` and a JSON rollout in `results/`. ### Run multiple models in parallel ```bash bash scripts/run_benchmark.sh --seeds "1 2 3" --config default ``` --- ## Configuration Experiment presets live in `src/yc_bench/config/presets/` as TOML files. Pass the preset name via `--config`. See `default.toml` for the full list of tunable parameters. --- ## Benchmark results

Average funds over time

--- Please cite our work if you find it useful! ```bibtex @misc{collinear-ai2025ycbench, author = {He, Muyu and Jain, Adit and Kumar, Anand and Tu, Vincent and Bakshi, Soumyadeep and Patro, Sachin and Rajani, Nazneen}, title = {{YC-Bench}: Benchmarking {AI} Agents for Long-Term Planning and Consistent Execution}, year = {2025}, howpublished = {\url{https://github.com/collinear-ai/yc-bench}}, } ```