Commit graph

70 commits

Author SHA1 Message Date
alckasoc
f95861aeb9 scope creep bot runner 2026-03-20 19:08:44 -07:00
alckasoc
b043b690c3 fix seeding 2026-03-20 18:43:19 -07:00
alckasoc
f76f5be652 calibrating + bug fix tool_choice="auto" for 5.4 mini/nano 2026-03-20 16:27:30 -07:00
alckasoc
d829b07e60 update prompt 2026-03-20 06:01:04 -07:00
alckasoc
3827464380 logging and plotting 2026-03-20 05:19:56 -07:00
Anand Kumar
e71aac14c2
Merge pull request #14 from alckasoc/results/main
remove scratchpad read
2026-03-19 20:28:38 -07:00
Anand Kumar
64941fdc20
Merge pull request #12 from collinear-ai/results/main
Results/main
2026-03-19 20:18:13 -07:00
alckasoc
04d945f5d9 remove scratchpad read 2026-03-19 19:24:09 -07:00
Muyu He
ef7c64b5cb Updated design mds 2026-03-19 18:39:57 -07:00
Muyu He
e049140beb Updated client loyalty feature 2026-03-19 17:52:49 -07:00
Muyu He
b6f664557c Removed browse limit from bot runner 2026-03-19 13:44:31 -07:00
Muyu He
4b8641a4c6 Changed default config for reward 2026-03-16 18:32:59 -07:00
Muyu He
140bb58653 Capped skill rate at 10 + removed reward mult from clients 2026-03-16 16:09:17 -07:00
Adit Jain
d976b9cbb4
Merge pull request #11 from collinear-ai/feat/multi-episode
Add multi-episode setting with scratchpad carryover
2026-03-13 18:21:37 -07:00
Adit Jain
bc633496fa
Merge pull request #10 from alckasoc/vincent/client_trust
Client Trust
2026-03-12 17:07:03 -07:00
alckasoc
ebfce99643 fix sim resume 2026-03-12 12:21:42 -07:00
alckasoc
70ae316f27 improved system design, more intuitive hparams, updated configs, greedy bot updates 2026-03-12 12:12:47 -07:00
adit jain
01535c2042 Add multi-episode setting with scratchpad carryover between bankruptcies
When an agent goes bankrupt, the simulation can now restart for another
episode while preserving the scratchpad from the previous attempt. This
lets us measure whether LLMs can learn from failure via persistent notes.

Each episode gets its own SQLite DB (*.ep1.db, *.ep2.db, ...) so plotting
scripts and post-hoc analysis work unchanged. The rollout JSON aggregates
per-episode transcripts, turns, and costs.

Key changes:
- --max-episodes CLI flag (default 1, fully backward compatible)
- Per-episode DB files when max_episodes > 1
- Scratchpad read from old DB, written into fresh DB between episodes
- RunState tracks episode results with finish_episode/reset_for_new_episode
- Agent prompt tells it about the episode number and to read its scratchpad
- Plotting script for multi-episode fund curves + scratchpad evolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 19:22:32 -07:00
alckasoc
3d20bee609 client trust and system design docs 2026-03-10 14:24:13 -07:00
alckasoc
d28ccb1bb2 Merge upstream/main: greedy baseline fix + additive skill boost
Resolved conflicts — combined best of both:
- bot_runner.py: kept our trust-aware candidate building + upstream's tier-avg rates + no task cap
- task_complete.py: upstream's additive skill boost (nerfs greedy snowball) + our configurable cap (wc.skill_rate_max instead of hardcoded 10)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 17:39:58 -07:00
alckasoc
11f4b89144 Add multi-strategy client trust system with tiers, specialties, and idle-turn fix
- Hide exact reward_multiplier from agent; show tier (Standard/Premium/Enterprise) and specialty domains instead
- Add client domain specialization with 70% bias on task generation toward client specialties
- Remove qty_scale by multiplier (leaked info and doubly punished high-mult clients)
- Rewrite agent prompt to describe tiers/specialties without exact formulas
- Fix critical loop.py bug: provide full state context after sim resume (prevents idle multi-month skips)
- Add Streamlit dashboard, watch scripts, and updated plotting/extraction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 17:37:49 -07:00
RiddleHe
a38b9f4135
Merge pull request #9 from collinear-ai/feat/fixed_greedy
Fixed greedy baseline and lowered min val of employee skills
2026-03-09 17:27:52 -07:00
alckasoc
7daccf003a update toml and uv lock 2026-03-09 16:40:51 -07:00
Muyu He
ec104d57aa Fixed greedy baseline and lowered min val of employee skills 2026-03-09 15:18:49 -07:00
alckasoc
27ca13afbc Merge remote-tracking branch 'upstream/main' into vincent/client_trust 2026-03-09 14:54:38 -07:00
RiddleHe
98aab68b57
Merge pull request #8 from collinear-ai/system-design-docs
Add system design documentation for yc-bench
2026-03-09 13:02:25 -07:00
alckasoc
86eabf6697 init 2026-03-08 17:40:10 -07:00
AnandK27
ecd3d9e415 Add system design documentation for yc-bench
Comprehensive documentation covering all major subsystems:
simulation engine, data models, task system, prestige, finances,
employees, agent layer, CLI interface, configuration, and runner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 13:42:41 -07:00
Adit Jain
b1cd7ebfb2
Merge pull request #7 from collinear-ai/feat/employee_tiers
Feat/employee tiers
2026-03-07 22:04:45 -08:00
Muyu He
7f24589793 Light update of readme 2026-03-06 18:56:46 -08:00
Muyu He
a456d9c6ae Updated initial eval on new backend 2026-03-06 18:49:32 -08:00
Muyu He
8c949db160 Fixed task difficulty with base reward & deadline change 2026-03-06 18:08:11 -08:00
Adit Jain
542d3b9836
Merge pull request #6 from collinear-ai/feat/employee_tiers
Updated backend to calculate employee tier
2026-03-06 14:45:33 -08:00
Muyu He
99e69190ec Calibrated domain prestge bump 2026-03-06 14:40:45 -08:00
Muyu He
5671e0102f Calibrated task difficulty based on deadlines 2026-03-06 11:18:22 -08:00
Muyu He
eb18c5a90c Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4 2026-03-05 18:12:48 -08:00
adit jain
6d6f0a855d Rename Greedy Bot to Human Devised Rule in README
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 16:21:32 -08:00
adit jain
763ed3d750 Rename Greedy Bot to Human Devised Rule, remove other bot baselines from plots
Updated both plot_comparison.py and plot_prestige_radar.py to show only
the greedy bot baseline renamed as "Human Devised Rule". Regenerated
both plots.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 14:03:04 -08:00
Adit Jain
89065f3487
Delete Sonnet results section from README
Removed Sonnet-only results section and associated image.
2026-02-28 02:39:53 +05:30
Adit Jain
91455bbca2
Merge pull request #5 from collinear-ai/fresh-main
Add prestige radar chart comparing domain specialization across models
2026-02-27 12:45:46 -08:00
adit jain
e9aa362772 Add prestige radar chart comparing domain specialization across models
New radar plot (7 domains × 4 models × 3 configs × 3 seeds) shows final
prestige fingerprints. Added plot script and README section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 12:45:04 -08:00
Adit Jain
81664f69bb
Merge pull request #4 from collinear-ai/fresh-main
Fixing start.sh
2026-02-26 22:20:38 -08:00
Adit Jain
5eebd80b2f
Merge branch 'main' into fresh-main 2026-02-26 22:20:24 -08:00
adit jain
95c6583053 Add live dashboard section to README
Document the new Rich terminal dashboard with ASCII mockup,
feature list, and --no-live flag usage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 22:16:20 -08:00
adit jain
f25a2be1e4 Add live terminal dashboard with Rich
Replace scrolling LiteLLM debug logs with an in-place Rich Live dashboard
that shows key metrics after each turn: funds sparkline, task progress bars
with colored domain labels, team skill bars, runway urgency, and more.

- New: src/yc_bench/runner/dashboard.py (BenchmarkDashboard, DashboardState)
- Add on_turn/on_turn_start callbacks to agent loop
- Auto-detect TTY, redirect all logging to logs/debug.log when live
- Add --no-live flag to disable dashboard and get old log output
- Use alternate screen buffer (screen=True) for clean rendering
- Fix start.sh: clean up stale temp files before mktemp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 22:13:32 -08:00
adit jain
d4ce0a1e5a Fix start.sh: re-download and re-exec when piped via curl
When run as `curl ... | bash`, stdin is the pipe so Rich prompts
abort immediately. Now detects non-tty stdin, re-downloads the script
to a temp file, and exec's it — stdin becomes the terminal again.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 22:13:32 -08:00
adit jain
040e678a76 Fix start.sh: reattach stdin to /dev/tty for curl pipe usage
When run via `curl ... | bash`, stdin is the pipe not the terminal,
causing interactive prompts to abort immediately. Adding </dev/tty
restores terminal input.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 22:13:32 -08:00
AnandK27
a406d2d9f9 readme fixes 2026-02-26 22:13:32 -08:00
Adit Jain
3281eff755 Update README with citation for YC-Bench
Added citation information for the YC-Bench project.
2026-02-26 22:13:32 -08:00
Adit Jain
2b528358a7 Fix formatting in README for discrete-event simulation 2026-02-26 22:13:32 -08:00