yc-bench

mirror of https://github.com/collinear-ai/yc-bench.git synced 2026-04-28 17:29:35 +00:00

Author	SHA1	Message	Date
alckasoc	70ae316f27	improved system design, more intuitive hparams, updated configs, greedy bot updates	2026-03-12 12:12:47 -07:00
alckasoc	d28ccb1bb2	Merge upstream/main: greedy baseline fix + additive skill boost Resolved conflicts — combined best of both: - bot_runner.py: kept our trust-aware candidate building + upstream's tier-avg rates + no task cap - task_complete.py: upstream's additive skill boost (nerfs greedy snowball) + our configurable cap (wc.skill_rate_max instead of hardcoded 10) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:39:58 -07:00
alckasoc	11f4b89144	Add multi-strategy client trust system with tiers, specialties, and idle-turn fix - Hide exact reward_multiplier from agent; show tier (Standard/Premium/Enterprise) and specialty domains instead - Add client domain specialization with 70% bias on task generation toward client specialties - Remove qty_scale by multiplier (leaked info and doubly punished high-mult clients) - Rewrite agent prompt to describe tiers/specialties without exact formulas - Fix critical loop.py bug: provide full state context after sim resume (prevents idle multi-month skips) - Add Streamlit dashboard, watch scripts, and updated plotting/extraction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:37:49 -07:00
Muyu He	ec104d57aa	Fixed greedy baseline and lowered min val of employee skills	2026-03-09 15:18:49 -07:00
alckasoc	86eabf6697	init	2026-03-08 17:40:10 -07:00
Muyu He	8c949db160	Fixed task difficulty with base reward & deadline change	2026-03-06 18:08:11 -08:00
Muyu He	99e69190ec	Calibrated domain prestge bump	2026-03-06 14:40:45 -08:00
Muyu He	5671e0102f	Calibrated task difficulty based on deadlines	2026-03-06 11:18:22 -08:00
Muyu He	eb18c5a90c	Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4	2026-03-05 18:12:48 -08:00
adit jain	763ed3d750	Rename Greedy Bot to Human Devised Rule, remove other bot baselines from plots Updated both plot_comparison.py and plot_prestige_radar.py to show only the greedy bot baseline renamed as "Human Devised Rule". Regenerated both plots. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 14:03:04 -08:00
adit jain	e9aa362772	Add prestige radar chart comparing domain specialization across models New radar plot (7 domains × 4 models × 3 configs × 3 seeds) shows final prestige fingerprints. Added plot script and README section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 12:45:04 -08:00
adit jain	5f31969865	Add Collinear branding, bot runners, and clean up stale plots - Restyle plot_comparison.py with Collinear brand palette and logo - Add collinear_logo.svg and collinear_wordmark.svg - Add bot_runner.py (greedy/random/throughput/prestige strategies) - Add greedy_bot.py shim - Remove old unused plots (funds_curves, notepad gifs, sonnet_results) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 21:12:05 -08:00
adit jain	3643806dce	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00
adit jain	5d2962073d	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results Bug fixes: - CLI --horizon-years defaulted to 3, silently overriding config presets. Now defaults to None so config value (1yr for medium/hard/nightmare) is used. - Runtime passed a single api_key kwarg regardless of provider, breaking Gemini. Now lets LiteLLM resolve keys from provider-specific env vars. - Removed temperature+top_p from LLM calls (Anthropic rejects both together). - DB and result filenames now include config name to prevent cross-config collisions. Benchmark results (1yr horizon, 3 seeds each): Sonnet 4.6: medium 2/3, hard 0/3, nightmare 1/3 Gemini Flash: medium 3/3, hard 1/3, nightmare 1/3 Gemini has higher win rates (93-98% vs 40-83% on medium). Sonnet's ceiling is higher when it survives (nightmare $10.1M vs $478K). New scripts: plot_comparison.py, plot_sonnet_results.py, notepad_gif.py Updated README with detailed comparison tables and failure analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 00:31:00 -08:00
adit jain	3a1c562827	Initial commit	2026-02-25 02:16:35 -08:00

15 commits