yc-bench

mirror of https://github.com/collinear-ai/yc-bench.git synced 2026-04-19 12:58:03 +00:00

Author	SHA1	Message	Date
Vincent Tu	bfb0c88062	Merge pull request #26 from collinear-ai/vincent/gemma4-31b-results gemma 4 31b results; went bankrupt!	2026-04-04 20:21:23 -07:00
alckasoc	a4a8208022	gemma 4 31b results; went bankrupt!	2026-04-04 20:20:11 -07:00
Vincent Tu	d253c58782	Merge pull request #25 from collinear-ai/vincent/website minor website update!	2026-04-04 17:59:47 -07:00
alckasoc	bce35279cb	minor website update!	2026-04-04 17:58:15 -07:00
Vincent Tu	e1cd26e36e	Merge pull request #24 from collinear-ai/vincent/website Update Website	2026-04-04 17:50:45 -07:00
alckasoc	f54585df5e	update website	2026-04-04 17:33:48 -07:00
Nazneen Rajani	ffd77905ae	Merge pull request #23 from collinear-ai/nazneenrajani-patch-1 Revise citation for YC-Bench in README	2026-04-04 16:55:25 -07:00
Nazneen Rajani	a9e3df8827	Revise citation for YC-Bench in README Updated citation details in the README file.	2026-04-04 16:55:13 -07:00
Anand Kumar	a5cee60c77	Merge pull request #21 from collinear-ai/vincent/readme update readme; clean up unused files; black formatting	2026-04-03 20:11:40 -07:00
alckasoc	faacc5886c	update webpage arxiv link	2026-04-03 15:39:16 -07:00
alckasoc	38eaea7d0c	update readme; clean up unused files; black formatting	2026-04-01 14:44:39 -07:00
Vincent Tu	97b1bdb2e0	Merge pull request #20 from collinear-ai/vincent/webpage GitHub Webpage	2026-04-01 13:56:52 -07:00
alckasoc	556a35363d	update index html	2026-04-01 13:56:42 -07:00
alckasoc	6eba7a9854	Add static docs site for GitHub Pages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 13:32:58 -07:00
RiddleHe	0c53c98f01	Merge pull request #19 from collinear-ai/results/main Results/main	2026-03-23 21:22:42 -07:00
RiddleHe	5f1a1dd185	Merge branch 'main' into results/main	2026-03-23 19:19:38 -07:00
Muyu He	f1d5f63aaa	Implemented safe rerun; fixed skill division bug	2026-03-23 19:14:47 -07:00
RiddleHe	93b4ff92b7	Merge pull request #17 from alckasoc/main bot runner scope creep	2026-03-20 19:10:38 -07:00
Vincent Tu	97a7fd69e9	Merge branch 'collinear-ai:main' into main	2026-03-20 19:09:17 -07:00
alckasoc	f95861aeb9	scope creep bot runner	2026-03-20 19:08:44 -07:00
Muyu He	2f38babba6	Fixed bot with RAT feature	2026-03-20 19:06:04 -07:00
Muyu He	6a34a1d572	Updated prompt / commands	2026-03-20 18:57:18 -07:00
RiddleHe	35467c050a	Merge pull request #16 from alckasoc/main fix seeding	2026-03-20 18:53:47 -07:00
alckasoc	b043b690c3	fix seeding	2026-03-20 18:43:19 -07:00
RiddleHe	e011030e57	Merge pull request #15 from alckasoc/main logging and plotting code + run sh	2026-03-20 17:33:18 -07:00
alckasoc	f76f5be652	calibrating + bug fix tool_choice="auto" for 5.4 mini/nano	2026-03-20 16:27:30 -07:00
alckasoc	d829b07e60	update prompt	2026-03-20 06:01:04 -07:00
alckasoc	3827464380	logging and plotting	2026-03-20 05:19:56 -07:00
Anand Kumar	e71aac14c2	Merge pull request #14 from alckasoc/results/main remove scratchpad read	2026-03-19 20:28:38 -07:00
Anand Kumar	64941fdc20	Merge pull request #12 from collinear-ai/results/main Results/main	2026-03-19 20:18:13 -07:00
alckasoc	04d945f5d9	remove scratchpad read	2026-03-19 19:24:09 -07:00
Muyu He	ef7c64b5cb	Updated design mds	2026-03-19 18:39:57 -07:00
Muyu He	e049140beb	Updated client loyalty feature	2026-03-19 17:52:49 -07:00
Muyu He	b6f664557c	Removed browse limit from bot runner	2026-03-19 13:44:31 -07:00
Muyu He	4b8641a4c6	Changed default config for reward	2026-03-16 18:32:59 -07:00
Muyu He	140bb58653	Capped skill rate at 10 + removed reward mult from clients	2026-03-16 16:09:17 -07:00
Adit Jain	d976b9cbb4	Merge pull request #11 from collinear-ai/feat/multi-episode Add multi-episode setting with scratchpad carryover	2026-03-13 18:21:37 -07:00
Adit Jain	bc633496fa	Merge pull request #10 from alckasoc/vincent/client_trust Client Trust	2026-03-12 17:07:03 -07:00
alckasoc	ebfce99643	fix sim resume	2026-03-12 12:21:42 -07:00
alckasoc	70ae316f27	improved system design, more intuitive hparams, updated configs, greedy bot updates	2026-03-12 12:12:47 -07:00
adit jain	01535c2042	Add multi-episode setting with scratchpad carryover between bankruptcies When an agent goes bankrupt, the simulation can now restart for another episode while preserving the scratchpad from the previous attempt. This lets us measure whether LLMs can learn from failure via persistent notes. Each episode gets its own SQLite DB (.ep1.db, .ep2.db, ...) so plotting scripts and post-hoc analysis work unchanged. The rollout JSON aggregates per-episode transcripts, turns, and costs. Key changes: - --max-episodes CLI flag (default 1, fully backward compatible) - Per-episode DB files when max_episodes > 1 - Scratchpad read from old DB, written into fresh DB between episodes - RunState tracks episode results with finish_episode/reset_for_new_episode - Agent prompt tells it about the episode number and to read its scratchpad - Plotting script for multi-episode fund curves + scratchpad evolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 19:22:32 -07:00
alckasoc	3d20bee609	client trust and system design docs	2026-03-10 14:24:13 -07:00
alckasoc	d28ccb1bb2	Merge upstream/main: greedy baseline fix + additive skill boost Resolved conflicts — combined best of both: - bot_runner.py: kept our trust-aware candidate building + upstream's tier-avg rates + no task cap - task_complete.py: upstream's additive skill boost (nerfs greedy snowball) + our configurable cap (wc.skill_rate_max instead of hardcoded 10) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:39:58 -07:00
alckasoc	11f4b89144	Add multi-strategy client trust system with tiers, specialties, and idle-turn fix - Hide exact reward_multiplier from agent; show tier (Standard/Premium/Enterprise) and specialty domains instead - Add client domain specialization with 70% bias on task generation toward client specialties - Remove qty_scale by multiplier (leaked info and doubly punished high-mult clients) - Rewrite agent prompt to describe tiers/specialties without exact formulas - Fix critical loop.py bug: provide full state context after sim resume (prevents idle multi-month skips) - Add Streamlit dashboard, watch scripts, and updated plotting/extraction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:37:49 -07:00
RiddleHe	a38b9f4135	Merge pull request #9 from collinear-ai/feat/fixed_greedy Fixed greedy baseline and lowered min val of employee skills	2026-03-09 17:27:52 -07:00
alckasoc	7daccf003a	update toml and uv lock	2026-03-09 16:40:51 -07:00
Muyu He	ec104d57aa	Fixed greedy baseline and lowered min val of employee skills	2026-03-09 15:18:49 -07:00
alckasoc	27ca13afbc	Merge remote-tracking branch 'upstream/main' into vincent/client_trust	2026-03-09 14:54:38 -07:00
RiddleHe	98aab68b57	Merge pull request #8 from collinear-ai/system-design-docs Add system design documentation for yc-bench	2026-03-09 13:02:25 -07:00
alckasoc	86eabf6697	init	2026-03-08 17:40:10 -07:00

1 2

93 commits