yc-bench

mirror of https://github.com/collinear-ai/yc-bench.git synced 2026-04-19 12:58:03 +00:00

Author	SHA1	Message	Date
alckasoc	f95861aeb9	scope creep bot runner	2026-03-20 19:08:44 -07:00
alckasoc	b043b690c3	fix seeding	2026-03-20 18:43:19 -07:00
alckasoc	f76f5be652	calibrating + bug fix tool_choice="auto" for 5.4 mini/nano	2026-03-20 16:27:30 -07:00
alckasoc	d829b07e60	update prompt	2026-03-20 06:01:04 -07:00
alckasoc	3827464380	logging and plotting	2026-03-20 05:19:56 -07:00
Anand Kumar	e71aac14c2	Merge pull request #14 from alckasoc/results/main remove scratchpad read	2026-03-19 20:28:38 -07:00
Anand Kumar	64941fdc20	Merge pull request #12 from collinear-ai/results/main Results/main	2026-03-19 20:18:13 -07:00
alckasoc	04d945f5d9	remove scratchpad read	2026-03-19 19:24:09 -07:00
Muyu He	ef7c64b5cb	Updated design mds	2026-03-19 18:39:57 -07:00
Muyu He	e049140beb	Updated client loyalty feature	2026-03-19 17:52:49 -07:00
Muyu He	b6f664557c	Removed browse limit from bot runner	2026-03-19 13:44:31 -07:00
Muyu He	4b8641a4c6	Changed default config for reward	2026-03-16 18:32:59 -07:00
Muyu He	140bb58653	Capped skill rate at 10 + removed reward mult from clients	2026-03-16 16:09:17 -07:00
Adit Jain	d976b9cbb4	Merge pull request #11 from collinear-ai/feat/multi-episode Add multi-episode setting with scratchpad carryover	2026-03-13 18:21:37 -07:00
Adit Jain	bc633496fa	Merge pull request #10 from alckasoc/vincent/client_trust Client Trust	2026-03-12 17:07:03 -07:00
alckasoc	ebfce99643	fix sim resume	2026-03-12 12:21:42 -07:00
alckasoc	70ae316f27	improved system design, more intuitive hparams, updated configs, greedy bot updates	2026-03-12 12:12:47 -07:00
adit jain	01535c2042	Add multi-episode setting with scratchpad carryover between bankruptcies When an agent goes bankrupt, the simulation can now restart for another episode while preserving the scratchpad from the previous attempt. This lets us measure whether LLMs can learn from failure via persistent notes. Each episode gets its own SQLite DB (.ep1.db, .ep2.db, ...) so plotting scripts and post-hoc analysis work unchanged. The rollout JSON aggregates per-episode transcripts, turns, and costs. Key changes: - --max-episodes CLI flag (default 1, fully backward compatible) - Per-episode DB files when max_episodes > 1 - Scratchpad read from old DB, written into fresh DB between episodes - RunState tracks episode results with finish_episode/reset_for_new_episode - Agent prompt tells it about the episode number and to read its scratchpad - Plotting script for multi-episode fund curves + scratchpad evolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 19:22:32 -07:00
alckasoc	3d20bee609	client trust and system design docs	2026-03-10 14:24:13 -07:00
alckasoc	d28ccb1bb2	Merge upstream/main: greedy baseline fix + additive skill boost Resolved conflicts — combined best of both: - bot_runner.py: kept our trust-aware candidate building + upstream's tier-avg rates + no task cap - task_complete.py: upstream's additive skill boost (nerfs greedy snowball) + our configurable cap (wc.skill_rate_max instead of hardcoded 10) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:39:58 -07:00
alckasoc	11f4b89144	Add multi-strategy client trust system with tiers, specialties, and idle-turn fix - Hide exact reward_multiplier from agent; show tier (Standard/Premium/Enterprise) and specialty domains instead - Add client domain specialization with 70% bias on task generation toward client specialties - Remove qty_scale by multiplier (leaked info and doubly punished high-mult clients) - Rewrite agent prompt to describe tiers/specialties without exact formulas - Fix critical loop.py bug: provide full state context after sim resume (prevents idle multi-month skips) - Add Streamlit dashboard, watch scripts, and updated plotting/extraction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:37:49 -07:00
RiddleHe	a38b9f4135	Merge pull request #9 from collinear-ai/feat/fixed_greedy Fixed greedy baseline and lowered min val of employee skills	2026-03-09 17:27:52 -07:00
alckasoc	7daccf003a	update toml and uv lock	2026-03-09 16:40:51 -07:00
Muyu He	ec104d57aa	Fixed greedy baseline and lowered min val of employee skills	2026-03-09 15:18:49 -07:00
alckasoc	27ca13afbc	Merge remote-tracking branch 'upstream/main' into vincent/client_trust	2026-03-09 14:54:38 -07:00
RiddleHe	98aab68b57	Merge pull request #8 from collinear-ai/system-design-docs Add system design documentation for yc-bench	2026-03-09 13:02:25 -07:00
alckasoc	86eabf6697	init	2026-03-08 17:40:10 -07:00
AnandK27	ecd3d9e415	Add system design documentation for yc-bench Comprehensive documentation covering all major subsystems: simulation engine, data models, task system, prestige, finances, employees, agent layer, CLI interface, configuration, and runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 13:42:41 -07:00
Adit Jain	b1cd7ebfb2	Merge pull request #7 from collinear-ai/feat/employee_tiers Feat/employee tiers	2026-03-07 22:04:45 -08:00
Muyu He	7f24589793	Light update of readme	2026-03-06 18:56:46 -08:00
Muyu He	a456d9c6ae	Updated initial eval on new backend	2026-03-06 18:49:32 -08:00
Muyu He	8c949db160	Fixed task difficulty with base reward & deadline change	2026-03-06 18:08:11 -08:00
Adit Jain	542d3b9836	Merge pull request #6 from collinear-ai/feat/employee_tiers Updated backend to calculate employee tier	2026-03-06 14:45:33 -08:00
Muyu He	99e69190ec	Calibrated domain prestge bump	2026-03-06 14:40:45 -08:00
Muyu He	5671e0102f	Calibrated task difficulty based on deadlines	2026-03-06 11:18:22 -08:00
Muyu He	eb18c5a90c	Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4	2026-03-05 18:12:48 -08:00
adit jain	6d6f0a855d	Rename Greedy Bot to Human Devised Rule in README Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 16:21:32 -08:00
adit jain	763ed3d750	Rename Greedy Bot to Human Devised Rule, remove other bot baselines from plots Updated both plot_comparison.py and plot_prestige_radar.py to show only the greedy bot baseline renamed as "Human Devised Rule". Regenerated both plots. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 14:03:04 -08:00
Adit Jain	89065f3487	Delete Sonnet results section from README Removed Sonnet-only results section and associated image.	2026-02-28 02:39:53 +05:30
Adit Jain	91455bbca2	Merge pull request #5 from collinear-ai/fresh-main Add prestige radar chart comparing domain specialization across models	2026-02-27 12:45:46 -08:00
adit jain	e9aa362772	Add prestige radar chart comparing domain specialization across models New radar plot (7 domains × 4 models × 3 configs × 3 seeds) shows final prestige fingerprints. Added plot script and README section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 12:45:04 -08:00
Adit Jain	81664f69bb	Merge pull request #4 from collinear-ai/fresh-main Fixing start.sh	2026-02-26 22:20:38 -08:00
Adit Jain	5eebd80b2f	Merge branch 'main' into fresh-main	2026-02-26 22:20:24 -08:00
adit jain	95c6583053	Add live dashboard section to README Document the new Rich terminal dashboard with ASCII mockup, feature list, and --no-live flag usage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 22:16:20 -08:00
adit jain	f25a2be1e4	Add live terminal dashboard with Rich Replace scrolling LiteLLM debug logs with an in-place Rich Live dashboard that shows key metrics after each turn: funds sparkline, task progress bars with colored domain labels, team skill bars, runway urgency, and more. - New: src/yc_bench/runner/dashboard.py (BenchmarkDashboard, DashboardState) - Add on_turn/on_turn_start callbacks to agent loop - Auto-detect TTY, redirect all logging to logs/debug.log when live - Add --no-live flag to disable dashboard and get old log output - Use alternate screen buffer (screen=True) for clean rendering - Fix start.sh: clean up stale temp files before mktemp Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 22:13:32 -08:00
adit jain	d4ce0a1e5a	Fix start.sh: re-download and re-exec when piped via curl When run as `curl ... \| bash`, stdin is the pipe so Rich prompts abort immediately. Now detects non-tty stdin, re-downloads the script to a temp file, and exec's it — stdin becomes the terminal again. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 22:13:32 -08:00
adit jain	040e678a76	Fix start.sh: reattach stdin to /dev/tty for curl pipe usage When run via `curl ... \| bash`, stdin is the pipe not the terminal, causing interactive prompts to abort immediately. Adding </dev/tty restores terminal input. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 22:13:32 -08:00
AnandK27	a406d2d9f9	readme fixes	2026-02-26 22:13:32 -08:00
Adit Jain	3281eff755	Update README with citation for YC-Bench Added citation information for the YC-Bench project.	2026-02-26 22:13:32 -08:00
Adit Jain	2b528358a7	Fix formatting in README for discrete-event simulation	2026-02-26 22:13:32 -08:00

1 2

70 commits