diff --git a/docs/index.html b/docs/index.html new file mode 100644 index 0000000..d02e07c --- /dev/null +++ b/docs/index.html @@ -0,0 +1,682 @@ + + + + + + YC-Bench: A Long-Horizon Agent Benchmark + + + + + + + + + + +
+
+

YC-Bench logoYC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

+

Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi, Sachin Patro, Nazneen Rajani

+

Collinear AI

+ +
+ YC-Bench System Architecture +
+
+
+ +
+
+

Abstract

+

+ As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce YC-Bench, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open-source, across 3 seeds each. Only three models consistently surpass the starting capital of $200K, with Claude Opus 4.6 achieving the highest average final funds at $1.27M, followed by GLM-5 at $1.21M with 11× lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for 47% of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. YC-Bench is open-source, reproducible, and configurable. +

+
+
+ +
+
+

Leaderboard

+

Average net worth across 3 seeds. All models start with $200K.

+ +
+ + +
+ + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + +
Rank Model Net Worth Bankrupt
1
Claude Opus 4.6Anthropic
$1.27M0/3
2
GLM-5Zhipu AI
$1.21M0/3
3
GPT-5.4OpenAI
$1.00M0/3
4
Kimi-K2.5Moonshot AI
$409K1/3
5
Gemini 3 FlashGoogle
$394K0/3
6
Gemini 3.1 Flash LiteGoogle
$203K1/3
7
GPT-5.4 MiniOpenAI
$138K1/3
8
Claude Sonnet 4.6Anthropic
$104K2/3
9
Qwen 3.5-397BAlibaba
$91K1/3
10
Gemini 3.1 ProGoogle
$66K1/3
11
GPT-5.4 NanoOpenAI
$39K1/3
12
Grok 4.20 BetaxAI
$25K2/3
-
Greedy BotBaseline
$03/3
+
+
+ + +
+
+ +
+
+
+
+ + + +
+
+

Key Findings

+ + +
+

Only a few models build client trust; most choose clients indiscriminately

+

+ Tasks that require trust come with higher rewards and smaller workloads, yet most models maintain minimal trust (level 1–2) with all clients instead of specializing. Only 4 out of 10 models across 6 out of 30 runs explicitly maintain a whitelist of preferred clients in their scratchpad. The rest distribute tasks indiscriminately, barring themselves from the highest-return tasks. +

+
+
+ Trust task ratio +

Proportion of completed tasks requiring client trust.

+
+
+ Trust levels per client +

Final trust level per client averaged across seeds (ADV = adversarial).

+
+
+
+ + +
+

Identifying adversarial clients remains a challenge for all but a few models

+

+ Half of all models accept adversarial tasks at a rate higher than their natural market share (~32%), showing indifference or misjudgment. Two-thirds of all runs make no mention of blacklisting any adversarial client. However, the top three models accept adversarial tasks at 1/4 the rate of the next best model –they correctly spot the work quantity inflation and write explicit avoidance guidelines to their scratchpad. +

+
+
+ Adversarial task ratio +

Ratio of adversarial tasks among all accepted tasks. Dashed line = natural market share (~32%).

+
+
+ Client selection policy +

Client selection policy observed in agent scratchpads per seed.

+
+
+
+ + +
+

Suboptimal employee assignment is the second-largest failure mode; cost efficiency varies dramatically

+

+ Beyond adversarial clients, 7 out of 11 models fail substantially from assigning employees whose productivity cannot meet deadlines, or from spreading employees across too many concurrent tasks. Models have perfect information about employee skills and task requirements, so these failures stem from poor estimation, not missing data. On cost efficiency, Kimi-K2.5 achieves 2.5× more revenue per API dollar than the next best model, while GLM-5 is 11× more cost-efficient than top-ranked Opus despite near-identical performance. +

+
+
+ Failure modes +

Failure mode breakdown: adversarial, wrong staffing, and over-split.

+
+
+ Cost efficiency +

Cost efficiency: in-game revenue per dollar of API cost.

+
+
+
+ + +
+

Four failure profiles reveal a spectrum of long-horizon incoherence

+

+ Opus rewrites its scratchpad ~34 times per run but occasionally violates its own blacklist. Flash executes a rigid 4-command loop every turn with zero adaptation, surviving through sheer throughput. Sonnet exhibits a reasoning–execution gap: it derives correct rules then immediately ignores them, averaging 7.2 concurrent tasks while its scratchpad says "one task at a time." Grok shows aware inaction: its scratchpad accurately diagnoses critical issues but it takes no corrective action, going bankrupt with just 6 days of runway after accepting a 0%-success-rate client. +

+
+ Error analysis grid +

Representative failure moments for four models: scratchpad state, agent action, and outcome.

+
+
+ + +
+

Long-horizon coherence is a pipeline, and models fail at different stages

+

+ Flash fails from the absence of reflection. Grok fails despite accurate reflection, unable to close the loop between diagnosis and action. Sonnet fails from temporally inconsistent reflection –rules written and immediately abandoned. Only Opus achieves sustained, self-correcting reflection. This suggests long-horizon coherence is not a single capability but a pipeline: perceive → record → retrieve → act consistently, and current models fail at different stages. +

+
+ +
+
+ + +
+
+

Evaluate Your Model

+
+

YC-Bench is open-source and works with any LiteLLM-compatible model. To run an evaluation:

+
git clone https://github.com/collinear-ai/yc-bench
+cd yc-bench && uv sync
+
+# Set your API key
+export OPENAI_API_KEY="sk-..."  # or ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.
+
+# Run a single evaluation
+uv run yc-bench run --model openai/gpt-5.4 --seed 1 --config medium
+
+# Run all 3 seeds
+for seed in 1 2 3; do
+  uv run yc-bench run --model openai/gpt-5.4 --seed $seed --config medium
+done
+

Each run produces a JSON result file in results/ and a SQLite database in db/. The benchmark uses the medium preset by default (moderate deadline pressure, 200 market tasks, 8 employees). See the README for full configuration options and preset descriptions.

+
+
+
+ +
+
+
+

BibTeX

+
@misc{collinear-ai2025ycbench, + author = {He, Muyu and Jain, Adit and Kumar, Anand and Tu, Vincent and Bakshi, Soumyadeep and Patro, Sachin and Rajani, Nazneen}, + title = {{YC-Bench}: Benchmarking {AI} Agents for Long-Term Planning and Consistent Execution}, + year = {2025}, + howpublished = {\url{https://github.com/collinear-ai/yc-bench}}, +}
+
+
+
+ + + + + + + diff --git a/docs/static/data.json b/docs/static/data.json new file mode 100644 index 0000000..f9af87e --- /dev/null +++ b/docs/static/data.json @@ -0,0 +1,184 @@ +{ + "glm-5": { + "2025-01": 200000, + "2025-02": 214138.36000000002, + "2025-03": 269224.0, + "2025-04": 341712.67333333334, + "2025-05": 487326.1166666667, + "2025-06": 620377.38, + "2025-07": 741208.8766666666, + "2025-08": 849482.7666666666, + "2025-09": 994341.61, + "2025-10": 1073679.43, + "2025-11": 1170285.2, + "2025-12": 1208190.0766666667 + }, + "kimi-k2.5": { + "2025-01": 200000, + "2025-02": 224139.04, + "2025-03": 209918.86666666667, + "2025-04": 208700.43333333335, + "2025-05": 240942.53000000003, + "2025-06": 215659.42333333334, + "2025-07": 201195.54666666663, + "2025-08": 216220.04666666666, + "2025-09": 219736.64666666664, + "2025-10": 296899.57, + "2025-11": 374644.36000000004, + "2025-12": 408821.86000000004 + }, + "qwen3.5-397b-a17b": { + "2025-01": 200000, + "2025-02": 177147.81666666665, + "2025-03": 121508.67666666668, + "2025-04": 80177.87666666666, + "2025-05": 29854.703333333335, + "2025-06": 18466.49, + "2025-07": 45042.15, + "2025-08": 45603.44, + "2025-09": 47883.573333333334, + "2025-10": 46848.473333333335, + "2025-11": 39942.89, + "2025-12": 90787.36333333334 + }, + "claude-opus-4-6": { + "2025-01": 200000, + "2025-02": 200564.68666666668, + "2025-03": 302163.43333333335, + "2025-04": 478133.33666666667, + "2025-05": 644443.2933333333, + "2025-06": 716843.6733333333, + "2025-07": 808076.8933333334, + "2025-08": 876622.0033333333, + "2025-09": 934738.1333333333, + "2025-10": 1079756.2666666666, + "2025-11": 1150021.2333333334, + "2025-12": 1269734.3466666667 + }, + "claude-sonnet-4-6": { + "2025-01": 200000, + "2025-02": 138038.02666666667, + "2025-03": 97319.72333333333, + "2025-04": 54087.41333333333, + "2025-05": 31477.03, + "2025-06": 32432.24666666667, + "2025-07": 39156.85, + "2025-08": 26806.873333333333, + "2025-09": 59911.12333333333, + "2025-10": 57028.753333333334, + "2025-11": 71174.62, + "2025-12": 104431.98666666668 + }, + "gemini-3-flash-preview": { + "2025-01": 200000, + "2025-02": 225444.52333333335, + "2025-03": 193552.44000000003, + "2025-04": 209580.71, + "2025-05": 177854.27666666664, + "2025-06": 163043.1, + "2025-07": 152712.84, + "2025-08": 208321.31333333332, + "2025-09": 267753.8333333333, + "2025-10": 296063.37666666665, + "2025-11": 290458.57666666666, + "2025-12": 393735.5466666667 + }, + "gemini-3.1-flash-lite-preview": { + "2025-01": 200000, + "2025-02": 201501.4, + "2025-03": 182152.89333333334, + "2025-04": 157992.29, + "2025-05": 168416.97333333333, + "2025-06": 218097.67, + "2025-07": 214683.26, + "2025-08": 214116.43000000002, + "2025-09": 257238.34333333335, + "2025-10": 246642.62, + "2025-11": 227082.25333333333, + "2025-12": 202899.33333333334 + }, + "gemini-3.1-pro-preview": { + "2025-01": 200000, + "2025-02": 216265.25, + "2025-03": 201200.18333333332, + "2025-04": 185643.62666666668, + "2025-05": 145523.10333333333, + "2025-06": 133887.75333333333, + "2025-07": 93736.46666666667, + "2025-08": 87567.81666666667, + "2025-09": 80444.06333333334, + "2025-10": 75586.65000000001, + "2025-11": 72824.79666666668, + "2025-12": 66104.01666666666 + }, + "greedy_bot": { + "2025-01": 200000, + "2025-02": 195083.7233333333, + "2025-03": 160214.76666666666, + "2025-04": 129408.26000000001, + "2025-05": 96258.86, + "2025-06": 55133.556666666664, + "2025-07": 19994.2, + "2025-08": 3755.5866666666666, + "2025-09": 0.0, + "2025-10": 0.0, + "2025-11": 0.0, + "2025-12": 0.0 + }, + "gpt-5.4-mini": { + "2025-01": 200000, + "2025-02": 197458.24, + "2025-03": 152852.71333333335, + "2025-04": 141545.38333333333, + "2025-05": 98890.34000000001, + "2025-06": 78264.45666666667, + "2025-07": 72373.06666666667, + "2025-08": 70060.39333333333, + "2025-09": 96610.55666666666, + "2025-10": 130051.49666666666, + "2025-11": 155125.25333333333, + "2025-12": 137648.50666666668 + }, + "gpt-5.4-nano": { + "2025-01": 200000, + "2025-02": 200902.8633333333, + "2025-03": 162820.65666666665, + "2025-04": 142168.1366666667, + "2025-05": 122985.65666666666, + "2025-06": 101876.77333333333, + "2025-07": 74880.49666666666, + "2025-08": 65333.486666666664, + "2025-09": 77696.05333333333, + "2025-10": 47863.73, + "2025-11": 25231.403333333335, + "2025-12": 39388.29666666667 + }, + "gpt-5.4": { + "2025-01": 200000, + "2025-02": 204629.84, + "2025-03": 194126.80000000002, + "2025-04": 209434.79333333333, + "2025-05": 257438.62000000002, + "2025-06": 339387.7, + "2025-07": 463976.23, + "2025-08": 600813.4833333334, + "2025-09": 626884.91, + "2025-10": 804869.02, + "2025-11": 939753.0866666666, + "2025-12": 1000803.4833333334 + }, + "grok-4.20-beta": { + "2025-01": 200000, + "2025-02": 206494.06333333332, + "2025-03": 165557.50333333333, + "2025-04": 158310.79666666666, + "2025-05": 133330.17, + "2025-06": 94587.42333333332, + "2025-07": 84629.66666666667, + "2025-08": 72742.33333333333, + "2025-09": 44473.276666666665, + "2025-10": 47249.363333333335, + "2025-11": 36218.026666666665, + "2025-12": 24874.28 + } +} \ No newline at end of file diff --git a/docs/static/images/adversarial_combined-a.png b/docs/static/images/adversarial_combined-a.png new file mode 100644 index 0000000..e19e191 Binary files /dev/null and b/docs/static/images/adversarial_combined-a.png differ diff --git a/docs/static/images/adversarial_combined-b.png b/docs/static/images/adversarial_combined-b.png new file mode 100644 index 0000000..8249ba0 Binary files /dev/null and b/docs/static/images/adversarial_combined-b.png differ diff --git a/docs/static/images/cat-point.png b/docs/static/images/cat-point.png new file mode 100644 index 0000000..2fabdf4 Binary files /dev/null and b/docs/static/images/cat-point.png differ diff --git a/docs/static/images/error_analysis_grid.png b/docs/static/images/error_analysis_grid.png new file mode 100644 index 0000000..7a9e45d Binary files /dev/null and b/docs/static/images/error_analysis_grid.png differ diff --git a/docs/static/images/failure_and_cost-a.png b/docs/static/images/failure_and_cost-a.png new file mode 100644 index 0000000..e4f3476 Binary files /dev/null and b/docs/static/images/failure_and_cost-a.png differ diff --git a/docs/static/images/failure_and_cost-b.png b/docs/static/images/failure_and_cost-b.png new file mode 100644 index 0000000..c67a4eb Binary files /dev/null and b/docs/static/images/failure_and_cost-b.png differ diff --git a/docs/static/images/funds_averaged_main.png b/docs/static/images/funds_averaged_main.png new file mode 100644 index 0000000..bba45e4 Binary files /dev/null and b/docs/static/images/funds_averaged_main.png differ diff --git a/docs/static/images/logos/Qwen_logo.svg.png b/docs/static/images/logos/Qwen_logo.svg.png new file mode 100644 index 0000000..c3ef158 Binary files /dev/null and b/docs/static/images/logos/Qwen_logo.svg.png differ diff --git a/docs/static/images/logos/Z.ai_(company_logo).svg.png b/docs/static/images/logos/Z.ai_(company_logo).svg.png new file mode 100644 index 0000000..4c53282 Binary files /dev/null and b/docs/static/images/logos/Z.ai_(company_logo).svg.png differ diff --git a/docs/static/images/logos/claude-color.png b/docs/static/images/logos/claude-color.png new file mode 100644 index 0000000..e29be47 Binary files /dev/null and b/docs/static/images/logos/claude-color.png differ diff --git a/docs/static/images/logos/gemini-color.png b/docs/static/images/logos/gemini-color.png new file mode 100644 index 0000000..e539633 Binary files /dev/null and b/docs/static/images/logos/gemini-color.png differ diff --git a/docs/static/images/logos/grok.png b/docs/static/images/logos/grok.png new file mode 100644 index 0000000..e11409e Binary files /dev/null and b/docs/static/images/logos/grok.png differ diff --git a/docs/static/images/logos/moonshotlogo.jpeg b/docs/static/images/logos/moonshotlogo.jpeg new file mode 100644 index 0000000..c0d8127 Binary files /dev/null and b/docs/static/images/logos/moonshotlogo.jpeg differ diff --git a/docs/static/images/logos/openai_logo_icon_248315.png b/docs/static/images/logos/openai_logo_icon_248315.png new file mode 100644 index 0000000..ac3da71 Binary files /dev/null and b/docs/static/images/logos/openai_logo_icon_248315.png differ diff --git a/docs/static/images/system_architecture.png b/docs/static/images/system_architecture.png new file mode 100644 index 0000000..913772a Binary files /dev/null and b/docs/static/images/system_architecture.png differ diff --git a/docs/static/images/trust_combined-a.png b/docs/static/images/trust_combined-a.png new file mode 100644 index 0000000..4548c61 Binary files /dev/null and b/docs/static/images/trust_combined-a.png differ diff --git a/docs/static/images/trust_combined-b.png b/docs/static/images/trust_combined-b.png new file mode 100644 index 0000000..39867b6 Binary files /dev/null and b/docs/static/images/trust_combined-b.png differ diff --git a/docs/static/images/yc_bench.png b/docs/static/images/yc_bench.png new file mode 100644 index 0000000..cc026e2 Binary files /dev/null and b/docs/static/images/yc_bench.png differ