Commit graph

15 commits

Author SHA1 Message Date
Muyu He
7f24589793 Light update of readme 2026-03-06 18:56:46 -08:00
Muyu He
8c949db160 Fixed task difficulty with base reward & deadline change 2026-03-06 18:08:11 -08:00
Muyu He
99e69190ec Calibrated domain prestge bump 2026-03-06 14:40:45 -08:00
Muyu He
eb18c5a90c Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4 2026-03-05 18:12:48 -08:00
adit jain
6d6f0a855d Rename Greedy Bot to Human Devised Rule in README
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 16:21:32 -08:00
Adit Jain
89065f3487
Delete Sonnet results section from README
Removed Sonnet-only results section and associated image.
2026-02-28 02:39:53 +05:30
adit jain
e9aa362772 Add prestige radar chart comparing domain specialization across models
New radar plot (7 domains × 4 models × 3 configs × 3 seeds) shows final
prestige fingerprints. Added plot script and README section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 12:45:04 -08:00
adit jain
95c6583053 Add live dashboard section to README
Document the new Rich terminal dashboard with ASCII mockup,
feature list, and --no-live flag usage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 22:16:20 -08:00
AnandK27
a406d2d9f9 readme fixes 2026-02-26 22:13:32 -08:00
Adit Jain
3281eff755 Update README with citation for YC-Bench
Added citation information for the YC-Bench project.
2026-02-26 22:13:32 -08:00
Adit Jain
2b528358a7 Fix formatting in README for discrete-event simulation 2026-02-26 22:13:32 -08:00
Adit Jain
14b0179f41 Fix formatting issues in README.md 2026-02-26 22:13:32 -08:00
adit jain
3643806dce Added the configs and updated the results. 2026-02-26 13:37:58 -08:00
adit jain
5d2962073d Fix horizon bug, multi-provider support, add Sonnet vs Gemini benchmark results
Bug fixes:
- CLI --horizon-years defaulted to 3, silently overriding config presets.
  Now defaults to None so config value (1yr for medium/hard/nightmare) is used.
- Runtime passed a single api_key kwarg regardless of provider, breaking
  Gemini. Now lets LiteLLM resolve keys from provider-specific env vars.
- Removed temperature+top_p from LLM calls (Anthropic rejects both together).
- DB and result filenames now include config name to prevent cross-config collisions.

Benchmark results (1yr horizon, 3 seeds each):
  Sonnet 4.6: medium 2/3, hard 0/3, nightmare 1/3
  Gemini Flash: medium 3/3, hard 1/3, nightmare 1/3
  Gemini has higher win rates (93-98% vs 40-83% on medium).
  Sonnet's ceiling is higher when it survives (nightmare $10.1M vs $478K).

New scripts: plot_comparison.py, plot_sonnet_results.py, notepad_gif.py
Updated README with detailed comparison tables and failure analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 00:31:00 -08:00
adit jain
3a1c562827 Initial commit 2026-02-25 02:16:35 -08:00