yc-bench/README.md
Nazneen Rajani a9e3df8827
Revise citation for YC-Bench in README
Updated citation details in the README file.
2026-04-04 16:55:13 -07:00

127 lines
5.8 KiB
Markdown

# <img src="imgs/yc_bench.png" alt="YC-Bench logo" width="40" /> YC-Bench
[![Website](https://img.shields.io/badge/Website-YC--Bench-E8864A)](https://collinear-ai.github.io/yc-bench/)
[![Python 3.12+](https://img.shields.io/badge/Python-3.12%2B-3776AB?logo=python&logoColor=white)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
A long-horizon deterministic benchmark for LLM agents. The agent operates a simulated AI startup over a one-year horizon, starting with $200,000 in funds, interacting exclusively through a CLI against a SQLite-backed discrete-event simulation.
The benchmark tests whether agents can manage compounding decisions: task selection, employee allocation, client trust, cash flow, and adversarial client detection — sustained over hundreds of turns.
<p align="center">
<img src="docs/static/images/system_architecture.png" alt="YC-Bench System Architecture" width="800" />
</p>
## How it works
### Core loop
1. Agent calls `sim resume` to advance the clock to the next event (task checkpoint, payroll, or horizon end).
2. The engine processes task progress, fires due events, and deducts monthly payroll.
3. Agent receives a status summary with events since the last turn, then issues observe and act commands.
4. Repeat until bankruptcy (funds < 0) or the one-year horizon ends.
Between time advances, the agent may issue arbitrarily many actions within a single turn. Work progresses only during business hours (weekdays), and payroll is deducted on the first business day of each month.
### Key mechanics
- **Tasks and domains**: The agent earns revenue by completing tasks from a marketplace. Each task belongs to one of four domains `training · inference · research · data engineering` and is issued by a client. Tasks have a reward, a deadline (activated on acceptance), and a work quantity employees must complete. Higher prestige unlocks higher-reward tasks and scales their payout. Failing a deadline incurs a 35% penalty of the reward and a prestige reduction.
- **Employees**: A fixed roster across 3 tiers (junior/mid/senior) with per-domain productivity levels queryable via `employee list`. Productivity distributions are spiky a senior may have high throughput in training but low in research. Successful completions grant a productivity boost in that domain but also a salary bump, so payroll grows monotonically.
- **Clients and trust**: Completing tasks for a client builds trust, which reduces future work requirements and unlocks higher-tier tasks. However, completing for one client slightly decays trust with all others.
- **Adversarial clients**: A subset of clients are adversarial after acceptance, they inflate work quantities, making deadlines nearly impossible. Adversarial status is hidden. These clients offer competitively high rewards, so the agent must infer adversarial behavior from repeated failures.
- **Memory**: Conversation history is truncated to the most recent 20 turns. The agent can write to a persistent scratchpad injected into the system prompt every turn its sole mechanism for retaining information across context truncation.
### Agent CLI
All commands return JSON. The agent interacts via `run_command("yc-bench <cmd>")`.
| Category | Command | Effect |
|----------|---------|--------|
| Observe | `company status` | Funds, prestige, payroll |
| Observe | `employee list` | Names, tiers, salaries, productivity |
| Observe | `market browse` | Available tasks with client, reward, domains |
| Observe | `task list` | Accepted tasks with status and progress |
| Observe | `task inspect --task-id T` | Per-domain progress, deadline, assignments |
| Observe | `client list` | Client trust levels and tiers |
| Observe | `client history` | Per-client success/failure counts |
| Observe | `finance ledger` | Full transaction history |
| Task | `task accept --task-id T` | Accept from market; starts deadline |
| Task | `task assign --task-id T --employees E` | Assign employees to task |
| Task | `task dispatch --task-id T` | Begin work on assigned task |
| Task | `task cancel --task-id T --reason R` | Abandon task; prestige penalty |
| Sim | `sim resume` | Advance clock to next event |
| Memory | `scratchpad write --content C` | Overwrite persistent notes |
| Memory | `scratchpad append --content C` | Append to persistent notes |
---
## Setup
### Prerequisites
- Python 3.12+
- [`uv`](https://github.com/astral-sh/uv)
### Install
```bash
git clone https://github.com/collinear-ai/yc-bench.git
cd yc-bench
uv sync
```
### API key
```bash
# .env (any LiteLLM-compatible provider)
ANTHROPIC_API_KEY="sk-ant-..." # for anthropic/claude-*
GEMINI_API_KEY="AIza..." # for gemini/gemini-*
OPENROUTER_API_KEY="sk-or-v1-..." # for openrouter/*
OPENAI_API_KEY="sk-..." # for openai/*
```
### Run
```bash
uv run yc-bench run \
--model gemini/gemini-3-flash-preview \
--seed 1 \
--config default
```
Outputs a SQLite DB in `db/` and a JSON rollout in `results/`.
### Run multiple models in parallel
```bash
bash scripts/run_benchmark.sh --seeds "1 2 3" --config default
```
---
## Configuration
Experiment presets live in `src/yc_bench/config/presets/` as TOML files. Pass the preset name via `--config`.
See `default.toml` for the full list of tunable parameters.
---
## Benchmark results
<p align="center">
<img src="docs/static/images/funds_averaged_main.png" alt="Average funds over time" width="700" />
</p>
---
Please cite our work if you find it useful!
```bibtex
@misc{collinear-ai2025ycbench,
author = {He, Muyu and Jain, Adit and Kumar, Anand and Tu, Vincent and Bakshi, Soumyadeep and Patro, Sachin and Rajani, Nazneen},
title = {{YC-Bench}: Benchmarking {AI} Agents for Long-Term Planning and Consistent Execution},
year = {2025},
howpublished = {\url{https://github.com/collinear-ai/yc-bench}},
}
```