yc-bench/scripts
adit jain 01535c2042 Add multi-episode setting with scratchpad carryover between bankruptcies
When an agent goes bankrupt, the simulation can now restart for another
episode while preserving the scratchpad from the previous attempt. This
lets us measure whether LLMs can learn from failure via persistent notes.

Each episode gets its own SQLite DB (*.ep1.db, *.ep2.db, ...) so plotting
scripts and post-hoc analysis work unchanged. The rollout JSON aggregates
per-episode transcripts, turns, and costs.

Key changes:
- --max-episodes CLI flag (default 1, fully backward compatible)
- Per-episode DB files when max_episodes > 1
- Scratchpad read from old DB, written into fresh DB between episodes
- RunState tracks episode results with finish_episode/reset_for_new_episode
- Agent prompt tells it about the episode number and to read its scratchpad
- Plotting script for multi-episode fund curves + scratchpad evolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 19:22:32 -07:00
..
bot_runner.py Fixed task difficulty with base reward & deadline change 2026-03-06 18:08:11 -08:00
greedy_bot.py Add Collinear branding, bot runners, and clean up stale plots 2026-02-26 21:12:05 -08:00
notepad_gif.py Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results 2026-02-26 00:31:00 -08:00
plot_comparison.py Rename Greedy Bot to Human Devised Rule, remove other bot baselines from plots 2026-02-27 14:03:04 -08:00
plot_multi_episode.py Add multi-episode setting with scratchpad carryover between bankruptcies 2026-03-11 19:22:32 -07:00
plot_multi_model.py Fixed task difficulty with base reward & deadline change 2026-03-06 18:08:11 -08:00
plot_prestige_radar.py Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4 2026-03-05 18:12:48 -08:00
plot_run.py Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4 2026-03-05 18:12:48 -08:00
plot_sonnet_results.py Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results 2026-02-26 00:31:00 -08:00
run_benchmark.sh Calibrated domain prestge bump 2026-03-06 14:40:45 -08:00