yc-bench

mirror of https://github.com/collinear-ai/yc-bench.git synced 2026-04-19 12:58:03 +00:00

Author	SHA1	Message	Date
alckasoc	b043b690c3	fix seeding	2026-03-20 18:43:19 -07:00
alckasoc	3827464380	logging and plotting	2026-03-20 05:19:56 -07:00
Muyu He	e049140beb	Updated client loyalty feature	2026-03-19 17:52:49 -07:00
Muyu He	b6f664557c	Removed browse limit from bot runner	2026-03-19 13:44:31 -07:00
Muyu He	140bb58653	Capped skill rate at 10 + removed reward mult from clients	2026-03-16 16:09:17 -07:00
Adit Jain	d976b9cbb4	Merge pull request #11 from collinear-ai/feat/multi-episode Add multi-episode setting with scratchpad carryover	2026-03-13 18:21:37 -07:00
alckasoc	70ae316f27	improved system design, more intuitive hparams, updated configs, greedy bot updates	2026-03-12 12:12:47 -07:00
adit jain	01535c2042	Add multi-episode setting with scratchpad carryover between bankruptcies When an agent goes bankrupt, the simulation can now restart for another episode while preserving the scratchpad from the previous attempt. This lets us measure whether LLMs can learn from failure via persistent notes. Each episode gets its own SQLite DB (.ep1.db, .ep2.db, ...) so plotting scripts and post-hoc analysis work unchanged. The rollout JSON aggregates per-episode transcripts, turns, and costs. Key changes: - --max-episodes CLI flag (default 1, fully backward compatible) - Per-episode DB files when max_episodes > 1 - Scratchpad read from old DB, written into fresh DB between episodes - RunState tracks episode results with finish_episode/reset_for_new_episode - Agent prompt tells it about the episode number and to read its scratchpad - Plotting script for multi-episode fund curves + scratchpad evolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 19:22:32 -07:00
alckasoc	d28ccb1bb2	Merge upstream/main: greedy baseline fix + additive skill boost Resolved conflicts — combined best of both: - bot_runner.py: kept our trust-aware candidate building + upstream's tier-avg rates + no task cap - task_complete.py: upstream's additive skill boost (nerfs greedy snowball) + our configurable cap (wc.skill_rate_max instead of hardcoded 10) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:39:58 -07:00
alckasoc	11f4b89144	Add multi-strategy client trust system with tiers, specialties, and idle-turn fix - Hide exact reward_multiplier from agent; show tier (Standard/Premium/Enterprise) and specialty domains instead - Add client domain specialization with 70% bias on task generation toward client specialties - Remove qty_scale by multiplier (leaked info and doubly punished high-mult clients) - Rewrite agent prompt to describe tiers/specialties without exact formulas - Fix critical loop.py bug: provide full state context after sim resume (prevents idle multi-month skips) - Add Streamlit dashboard, watch scripts, and updated plotting/extraction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:37:49 -07:00
Muyu He	ec104d57aa	Fixed greedy baseline and lowered min val of employee skills	2026-03-09 15:18:49 -07:00
alckasoc	86eabf6697	init	2026-03-08 17:40:10 -07:00
Muyu He	8c949db160	Fixed task difficulty with base reward & deadline change	2026-03-06 18:08:11 -08:00
Muyu He	99e69190ec	Calibrated domain prestge bump	2026-03-06 14:40:45 -08:00
Muyu He	5671e0102f	Calibrated task difficulty based on deadlines	2026-03-06 11:18:22 -08:00
Muyu He	eb18c5a90c	Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4	2026-03-05 18:12:48 -08:00
adit jain	763ed3d750	Rename Greedy Bot to Human Devised Rule, remove other bot baselines from plots Updated both plot_comparison.py and plot_prestige_radar.py to show only the greedy bot baseline renamed as "Human Devised Rule". Regenerated both plots. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 14:03:04 -08:00
adit jain	e9aa362772	Add prestige radar chart comparing domain specialization across models New radar plot (7 domains × 4 models × 3 configs × 3 seeds) shows final prestige fingerprints. Added plot script and README section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 12:45:04 -08:00
adit jain	5f31969865	Add Collinear branding, bot runners, and clean up stale plots - Restyle plot_comparison.py with Collinear brand palette and logo - Add collinear_logo.svg and collinear_wordmark.svg - Add bot_runner.py (greedy/random/throughput/prestige strategies) - Add greedy_bot.py shim - Remove old unused plots (funds_curves, notepad gifs, sonnet_results) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 21:12:05 -08:00
adit jain	3643806dce	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00
adit jain	5d2962073d	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results Bug fixes: - CLI --horizon-years defaulted to 3, silently overriding config presets. Now defaults to None so config value (1yr for medium/hard/nightmare) is used. - Runtime passed a single api_key kwarg regardless of provider, breaking Gemini. Now lets LiteLLM resolve keys from provider-specific env vars. - Removed temperature+top_p from LLM calls (Anthropic rejects both together). - DB and result filenames now include config name to prevent cross-config collisions. Benchmark results (1yr horizon, 3 seeds each): Sonnet 4.6: medium 2/3, hard 0/3, nightmare 1/3 Gemini Flash: medium 3/3, hard 1/3, nightmare 1/3 Gemini has higher win rates (93-98% vs 40-83% on medium). Sonnet's ceiling is higher when it survives (nightmare $10.1M vs $478K). New scripts: plot_comparison.py, plot_sonnet_results.py, notepad_gif.py Updated README with detailed comparison tables and failure analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 00:31:00 -08:00
adit jain	3a1c562827	Initial commit	2026-02-25 02:16:35 -08:00

22 commits