Replace scrolling LiteLLM debug logs with an in-place Rich Live dashboard
that shows key metrics after each turn: funds sparkline, task progress bars
with colored domain labels, team skill bars, runway urgency, and more.
- New: src/yc_bench/runner/dashboard.py (BenchmarkDashboard, DashboardState)
- Add on_turn/on_turn_start callbacks to agent loop
- Auto-detect TTY, redirect all logging to logs/debug.log when live
- Add --no-live flag to disable dashboard and get old log output
- Use alternate screen buffer (screen=True) for clean rendering
- Fix start.sh: clean up stale temp files before mktemp
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3-step interactive flow: pick difficulty (with custom preset builder),
choose model from curated list (Claude, GPT, Gemini, DeepSeek, etc.),
enter API key (auto-detected by prefix). Single curl command to get started:
curl -sSL https://raw.githubusercontent.com/.../start.sh | bash
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bug fixes:
- CLI --horizon-years defaulted to 3, silently overriding config presets.
Now defaults to None so config value (1yr for medium/hard/nightmare) is used.
- Runtime passed a single api_key kwarg regardless of provider, breaking
Gemini. Now lets LiteLLM resolve keys from provider-specific env vars.
- Removed temperature+top_p from LLM calls (Anthropic rejects both together).
- DB and result filenames now include config name to prevent cross-config collisions.
Benchmark results (1yr horizon, 3 seeds each):
Sonnet 4.6: medium 2/3, hard 0/3, nightmare 1/3
Gemini Flash: medium 3/3, hard 1/3, nightmare 1/3
Gemini has higher win rates (93-98% vs 40-83% on medium).
Sonnet's ceiling is higher when it survives (nightmare $10.1M vs $478K).
New scripts: plot_comparison.py, plot_sonnet_results.py, notepad_gif.py
Updated README with detailed comparison tables and failure analysis.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>