mirror of
https://github.com/collinear-ai/yc-bench.git
synced 2026-04-27 17:23:15 +00:00
Bug fixes: - CLI --horizon-years defaulted to 3, silently overriding config presets. Now defaults to None so config value (1yr for medium/hard/nightmare) is used. - Runtime passed a single api_key kwarg regardless of provider, breaking Gemini. Now lets LiteLLM resolve keys from provider-specific env vars. - Removed temperature+top_p from LLM calls (Anthropic rejects both together). - DB and result filenames now include config name to prevent cross-config collisions. Benchmark results (1yr horizon, 3 seeds each): Sonnet 4.6: medium 2/3, hard 0/3, nightmare 1/3 Gemini Flash: medium 3/3, hard 1/3, nightmare 1/3 Gemini has higher win rates (93-98% vs 40-83% on medium). Sonnet's ceiling is higher when it survives (nightmare $10.1M vs $478K). New scripts: plot_comparison.py, plot_sonnet_results.py, notepad_gif.py Updated README with detailed comparison tables and failure analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 lines
No EOL
443 B
JSON
14 lines
No EOL
443 B
JSON
{
|
|
"session_id": "run-1-openrouter/google/gemini-2.5-flash-preview",
|
|
"model": "openrouter/google/gemini-2.5-flash-preview",
|
|
"seed": 1,
|
|
"horizon_years": 3,
|
|
"turns_completed": 0,
|
|
"terminal": true,
|
|
"terminal_reason": "error",
|
|
"terminal_detail": "Failed to run turn after 3 attempts",
|
|
"total_cost_usd": 0.0,
|
|
"started_at": "2026-02-25T08:41:49.559479+00:00",
|
|
"ended_at": "2026-02-25T08:41:53.002014+00:00",
|
|
"transcript": []
|
|
} |