yc-bench/results/yc_bench_result_1_openrouter_liquid_lfm-2.5-1.2b-thinking:free.json
adit jain 5d2962073d Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results
Bug fixes:
- CLI --horizon-years defaulted to 3, silently overriding config presets.
  Now defaults to None so config value (1yr for medium/hard/nightmare) is used.
- Runtime passed a single api_key kwarg regardless of provider, breaking
  Gemini. Now lets LiteLLM resolve keys from provider-specific env vars.
- Removed temperature+top_p from LLM calls (Anthropic rejects both together).
- DB and result filenames now include config name to prevent cross-config collisions.

Benchmark results (1yr horizon, 3 seeds each):
  Sonnet 4.6: medium 2/3, hard 0/3, nightmare 1/3
  Gemini Flash: medium 3/3, hard 1/3, nightmare 1/3
  Gemini has higher win rates (93-98% vs 40-83% on medium).
  Sonnet's ceiling is higher when it survives (nightmare $10.1M vs $478K).

New scripts: plot_comparison.py, plot_sonnet_results.py, notepad_gif.py
Updated README with detailed comparison tables and failure analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 00:31:00 -08:00

13 lines
No EOL
422 B
JSON

{
"session_id": "run-1-openrouter/liquid/lfm-2.5-1.2b-thinking:free",
"model": "openrouter/liquid/lfm-2.5-1.2b-thinking:free",
"seed": 1,
"horizon_years": 3,
"turns_completed": 0,
"terminal": true,
"terminal_reason": "error",
"terminal_detail": "Failed to run turn after 3 attempts",
"started_at": "2026-02-24T22:33:23.091285+00:00",
"ended_at": "2026-02-24T22:33:26.459601+00:00",
"transcript": []
}