yc-bench

mirror of https://github.com/collinear-ai/yc-bench.git synced 2026-04-19 12:58:03 +00:00

History

adit jain 01535c2042 Add multi-episode setting with scratchpad carryover between bankruptcies When an agent goes bankrupt, the simulation can now restart for another episode while preserving the scratchpad from the previous attempt. This lets us measure whether LLMs can learn from failure via persistent notes. Each episode gets its own SQLite DB (.ep1.db, .ep2.db, ...) so plotting scripts and post-hoc analysis work unchanged. The rollout JSON aggregates per-episode transcripts, turns, and costs. Key changes: - --max-episodes CLI flag (default 1, fully backward compatible) - Per-episode DB files when max_episodes > 1 - Scratchpad read from old DB, written into fresh DB between episodes - RunState tracks episode results with finish_episode/reset_for_new_episode - Agent prompt tells it about the episode number and to read its scratchpad - Plotting script for multi-episode fund curves + scratchpad evolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>		2026-03-11 19:22:32 -07:00
..
yc_bench_result_1_openrouter_google_gemini-2.5-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_google_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_google_gemini-flash-1.5.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_liquid_lfm-2.5-1.2b-thinking:free.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_minimax_minimax-m2.5.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_moonshotai_kimi-k2.5.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_nvidia_nemotron-3-nano-30b-a3b:free.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_openai_gpt-4o-mini.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_openai_gpt-5.2-pro.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_x-ai_grok-4.1-fast.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_1_openrouter_z-ai_glm-5.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_fast_test_1_openai_gpt-4.1-mini.json	Calibrated task difficulty based on deadlines	2026-03-06 11:18:22 -08:00
yc_bench_result_hard_1_anthropic_claude-sonnet-4-6.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_hard_1_gemini_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_hard_1_openai_gpt-5.2.json	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00
yc_bench_result_hard_1_openai_gpt-5.4.json	Updated initial eval on new backend	2026-03-06 18:49:32 -08:00
yc_bench_result_hard_1_openrouter_anthropic_claude-haiku-4-5.json	Add multi-episode setting with scratchpad carryover between bankruptcies	2026-03-11 19:22:32 -07:00
yc_bench_result_hard_2_anthropic_claude-sonnet-4-6.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_hard_2_gemini_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_hard_2_openai_gpt-5.2.json	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00
yc_bench_result_hard_3_anthropic_claude-sonnet-4-6.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_hard_3_gemini_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_medium_1_gemini_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_medium_1_openai_gpt-5.2.json	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00
yc_bench_result_medium_1_openai_gpt-5.4.json	Updated initial eval on new backend	2026-03-06 18:49:32 -08:00
yc_bench_result_medium_2_gemini_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_medium_2_openai_gpt-5.2.json	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00
yc_bench_result_medium_3_anthropic_claude-sonnet-4-6.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_medium_3_gemini_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_medium_3_openai_gpt-5.2.json	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00
yc_bench_result_nightmare_1_anthropic_claude-sonnet-4-6.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_nightmare_1_gemini_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_nightmare_1_openai_gpt-5.2.json	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00
yc_bench_result_nightmare_2_gemini_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_nightmare_2_openai_gpt-5.2.json	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00
yc_bench_result_nightmare_3_anthropic_claude-sonnet-4-6.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_nightmare_3_gemini_gemini-3-flash-preview.json	Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results	2026-02-26 00:31:00 -08:00
yc_bench_result_nightmare_3_openai_gpt-5.2.json	Added the configs and updated the results.	2026-02-26 13:37:58 -08:00