Commit graph

20 commits

Author SHA1 Message Date
Adit Jain
d976b9cbb4
Merge pull request #11 from collinear-ai/feat/multi-episode
Add multi-episode setting with scratchpad carryover
2026-03-13 18:21:37 -07:00
alckasoc
ebfce99643 fix sim resume 2026-03-12 12:21:42 -07:00
alckasoc
70ae316f27 improved system design, more intuitive hparams, updated configs, greedy bot updates 2026-03-12 12:12:47 -07:00
adit jain
01535c2042 Add multi-episode setting with scratchpad carryover between bankruptcies
When an agent goes bankrupt, the simulation can now restart for another
episode while preserving the scratchpad from the previous attempt. This
lets us measure whether LLMs can learn from failure via persistent notes.

Each episode gets its own SQLite DB (*.ep1.db, *.ep2.db, ...) so plotting
scripts and post-hoc analysis work unchanged. The rollout JSON aggregates
per-episode transcripts, turns, and costs.

Key changes:
- --max-episodes CLI flag (default 1, fully backward compatible)
- Per-episode DB files when max_episodes > 1
- Scratchpad read from old DB, written into fresh DB between episodes
- RunState tracks episode results with finish_episode/reset_for_new_episode
- Agent prompt tells it about the episode number and to read its scratchpad
- Plotting script for multi-episode fund curves + scratchpad evolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 19:22:32 -07:00
alckasoc
3d20bee609 client trust and system design docs 2026-03-10 14:24:13 -07:00
alckasoc
d28ccb1bb2 Merge upstream/main: greedy baseline fix + additive skill boost
Resolved conflicts — combined best of both:
- bot_runner.py: kept our trust-aware candidate building + upstream's tier-avg rates + no task cap
- task_complete.py: upstream's additive skill boost (nerfs greedy snowball) + our configurable cap (wc.skill_rate_max instead of hardcoded 10)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 17:39:58 -07:00
alckasoc
11f4b89144 Add multi-strategy client trust system with tiers, specialties, and idle-turn fix
- Hide exact reward_multiplier from agent; show tier (Standard/Premium/Enterprise) and specialty domains instead
- Add client domain specialization with 70% bias on task generation toward client specialties
- Remove qty_scale by multiplier (leaked info and doubly punished high-mult clients)
- Rewrite agent prompt to describe tiers/specialties without exact formulas
- Fix critical loop.py bug: provide full state context after sim resume (prevents idle multi-month skips)
- Add Streamlit dashboard, watch scripts, and updated plotting/extraction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 17:37:49 -07:00
Muyu He
ec104d57aa Fixed greedy baseline and lowered min val of employee skills 2026-03-09 15:18:49 -07:00
alckasoc
86eabf6697 init 2026-03-08 17:40:10 -07:00
Muyu He
8c949db160 Fixed task difficulty with base reward & deadline change 2026-03-06 18:08:11 -08:00
Muyu He
99e69190ec Calibrated domain prestge bump 2026-03-06 14:40:45 -08:00
Muyu He
5671e0102f Calibrated task difficulty based on deadlines 2026-03-06 11:18:22 -08:00
Muyu He
eb18c5a90c Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4 2026-03-05 18:12:48 -08:00
adit jain
f25a2be1e4 Add live terminal dashboard with Rich
Replace scrolling LiteLLM debug logs with an in-place Rich Live dashboard
that shows key metrics after each turn: funds sparkline, task progress bars
with colored domain labels, team skill bars, runway urgency, and more.

- New: src/yc_bench/runner/dashboard.py (BenchmarkDashboard, DashboardState)
- Add on_turn/on_turn_start callbacks to agent loop
- Auto-detect TTY, redirect all logging to logs/debug.log when live
- Add --no-live flag to disable dashboard and get old log output
- Use alternate screen buffer (screen=True) for clean rendering
- Fix start.sh: clean up stale temp files before mktemp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 22:13:32 -08:00
adit jain
db7d9f218a Add db/ source files that were blocked by overly broad gitignore
The old `db/` pattern in .gitignore matched src/yc_bench/db/ too,
preventing all ORM models and session.py from being committed.
Previous commit fixed .gitignore to `/db/`; this adds the 10 missing files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 21:19:45 -08:00
adit jain
a11b2828a9 Fix fresh install: add missing __init__.py and fix .gitignore
Fresh clones failed with ModuleNotFoundError because agent/, db/,
runner/, and services/ subpackages had no __init__.py. Also anchor
/db/ and /logs/ in .gitignore so they don't match src/yc_bench/db/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 21:15:17 -08:00
adit jain
75a53de25c Add interactive quickstart: yc-bench start and one-line start.sh
3-step interactive flow: pick difficulty (with custom preset builder),
choose model from curated list (Claude, GPT, Gemini, DeepSeek, etc.),
enter API key (auto-detected by prefix). Single curl command to get started:
curl -sSL https://raw.githubusercontent.com/.../start.sh | bash

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 21:10:56 -08:00
adit jain
5d2962073d Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results
Bug fixes:
- CLI --horizon-years defaulted to 3, silently overriding config presets.
  Now defaults to None so config value (1yr for medium/hard/nightmare) is used.
- Runtime passed a single api_key kwarg regardless of provider, breaking
  Gemini. Now lets LiteLLM resolve keys from provider-specific env vars.
- Removed temperature+top_p from LLM calls (Anthropic rejects both together).
- DB and result filenames now include config name to prevent cross-config collisions.

Benchmark results (1yr horizon, 3 seeds each):
  Sonnet 4.6: medium 2/3, hard 0/3, nightmare 1/3
  Gemini Flash: medium 3/3, hard 1/3, nightmare 1/3
  Gemini has higher win rates (93-98% vs 40-83% on medium).
  Sonnet's ceiling is higher when it survives (nightmare $10.1M vs $478K).

New scripts: plot_comparison.py, plot_sonnet_results.py, notepad_gif.py
Updated README with detailed comparison tables and failure analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 00:31:00 -08:00
adit jain
d1d7bc97b5 Add 5-level difficulty gradient: tutorial → easy → medium → hard → nightmare
Each config is 1-year, no turn limit, testing progressively deeper
understanding of the simulation dynamics:

- tutorial: basic loop (accept→assign→dispatch→resume)
- easy: throughput awareness (rate/N dilution kills parallelism)
- medium: prestige strategy (must specialise 2-3 domains to unlock market)
- hard: ETA computation (one bad accept degrades in-flight tasks)
- nightmare: sustained perfection (5.4mo runway, must reach prestige 5 or die)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 19:33:55 -08:00
adit jain
3a1c562827 Initial commit 2026-02-25 02:16:35 -08:00