yc-bench

mirror of https://github.com/collinear-ai/yc-bench.git synced 2026-04-29 17:35:12 +00:00

Author	SHA1	Message	Date
adit jain	db7d9f218a	Add db/ source files that were blocked by overly broad gitignore The old `db/` pattern in .gitignore matched src/yc_bench/db/ too, preventing all ORM models and session.py from being committed. Previous commit fixed .gitignore to `/db/`; this adds the 10 missing files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 21:19:45 -08:00
adit jain	a11b2828a9	Fix fresh install: add missing __init__.py and fix .gitignore Fresh clones failed with ModuleNotFoundError because agent/, db/, runner/, and services/ subpackages had no __init__.py. Also anchor /db/ and /logs/ in .gitignore so they don't match src/yc_bench/db/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 21:15:17 -08:00
adit jain	75a53de25c	Add interactive quickstart: `yc-bench start` and one-line `start.sh` 3-step interactive flow: pick difficulty (with custom preset builder), choose model from curated list (Claude, GPT, Gemini, DeepSeek, etc.), enter API key (auto-detected by prefix). Single curl command to get started: curl -sSL https://raw.githubusercontent.com/.../start.sh \| bash Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 21:10:56 -08:00
adit jain	5d2962073d	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results Bug fixes: - CLI --horizon-years defaulted to 3, silently overriding config presets. Now defaults to None so config value (1yr for medium/hard/nightmare) is used. - Runtime passed a single api_key kwarg regardless of provider, breaking Gemini. Now lets LiteLLM resolve keys from provider-specific env vars. - Removed temperature+top_p from LLM calls (Anthropic rejects both together). - DB and result filenames now include config name to prevent cross-config collisions. Benchmark results (1yr horizon, 3 seeds each): Sonnet 4.6: medium 2/3, hard 0/3, nightmare 1/3 Gemini Flash: medium 3/3, hard 1/3, nightmare 1/3 Gemini has higher win rates (93-98% vs 40-83% on medium). Sonnet's ceiling is higher when it survives (nightmare $10.1M vs $478K). New scripts: plot_comparison.py, plot_sonnet_results.py, notepad_gif.py Updated README with detailed comparison tables and failure analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 00:31:00 -08:00
adit jain	d1d7bc97b5	Add 5-level difficulty gradient: tutorial → easy → medium → hard → nightmare Each config is 1-year, no turn limit, testing progressively deeper understanding of the simulation dynamics: - tutorial: basic loop (accept→assign→dispatch→resume) - easy: throughput awareness (rate/N dilution kills parallelism) - medium: prestige strategy (must specialise 2-3 domains to unlock market) - hard: ETA computation (one bad accept degrades in-flight tasks) - nightmare: sustained perfection (5.4mo runway, must reach prestige 5 or die) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-25 19:33:55 -08:00
adit jain	3a1c562827	Initial commit	2026-02-25 02:16:35 -08:00

6 commits