When an agent goes bankrupt, the simulation can now restart for another
episode while preserving the scratchpad from the previous attempt. This
lets us measure whether LLMs can learn from failure via persistent notes.
Each episode gets its own SQLite DB (*.ep1.db, *.ep2.db, ...) so plotting
scripts and post-hoc analysis work unchanged. The rollout JSON aggregates
per-episode transcripts, turns, and costs.
Key changes:
- --max-episodes CLI flag (default 1, fully backward compatible)
- Per-episode DB files when max_episodes > 1
- Scratchpad read from old DB, written into fresh DB between episodes
- RunState tracks episode results with finish_episode/reset_for_new_episode
- Agent prompt tells it about the episode number and to read its scratchpad
- Plotting script for multi-episode fund curves + scratchpad evolution
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolved conflicts — combined best of both:
- bot_runner.py: kept our trust-aware candidate building + upstream's tier-avg rates + no task cap
- task_complete.py: upstream's additive skill boost (nerfs greedy snowball) + our configurable cap (wc.skill_rate_max instead of hardcoded 10)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Hide exact reward_multiplier from agent; show tier (Standard/Premium/Enterprise) and specialty domains instead
- Add client domain specialization with 70% bias on task generation toward client specialties
- Remove qty_scale by multiplier (leaked info and doubly punished high-mult clients)
- Rewrite agent prompt to describe tiers/specialties without exact formulas
- Fix critical loop.py bug: provide full state context after sim resume (prevents idle multi-month skips)
- Add Streamlit dashboard, watch scripts, and updated plotting/extraction
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive documentation covering all major subsystems:
simulation engine, data models, task system, prestige, finances,
employees, agent layer, CLI interface, configuration, and runner.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updated both plot_comparison.py and plot_prestige_radar.py to show only
the greedy bot baseline renamed as "Human Devised Rule". Regenerated
both plots.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the new Rich terminal dashboard with ASCII mockup,
feature list, and --no-live flag usage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace scrolling LiteLLM debug logs with an in-place Rich Live dashboard
that shows key metrics after each turn: funds sparkline, task progress bars
with colored domain labels, team skill bars, runway urgency, and more.
- New: src/yc_bench/runner/dashboard.py (BenchmarkDashboard, DashboardState)
- Add on_turn/on_turn_start callbacks to agent loop
- Auto-detect TTY, redirect all logging to logs/debug.log when live
- Add --no-live flag to disable dashboard and get old log output
- Use alternate screen buffer (screen=True) for clean rendering
- Fix start.sh: clean up stale temp files before mktemp
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When run as `curl ... | bash`, stdin is the pipe so Rich prompts
abort immediately. Now detects non-tty stdin, re-downloads the script
to a temp file, and exec's it — stdin becomes the terminal again.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When run via `curl ... | bash`, stdin is the pipe not the terminal,
causing interactive prompts to abort immediately. Adding </dev/tty
restores terminal input.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>