Merge pull request #7 from collinear-ai/feat/employee_tiers

Feat/employee tiers
2026-04-19 12:58:03 +00:00 · 2026-03-07 22:04:45 -08:00 · 2026-03-07 22:04:45 -08:00 · b1cd7ebfb2
commit b1cd7ebfb2
parent 542d3b9836 7f24589793
19 changed files with 7244 additions and 5336 deletions
--- a/.DS_Store
+++ b/.DS_Store
--- a/README.md
+++ b/README.md
@ -56,8 +56,8 @@ bash scripts/run_benchmark.sh --seed 1 --config hard

 ### Core loop

-1. Agent calls `yc-bench sim resume` to advance time to the next event.
-2. The engine flushes task progress, fires due events, applies payroll.
+1. Agent calls `yc-bench sim resume` to advance time to the next event or monthly payroll.
+2. The engine flushes task progress, applies prestige decay, fires due events, applies payroll.
 3. Agent reads wake events and decides: accept tasks, assign employees, dispatch, cancel.
 4. Repeat until bankruptcy or horizon end.

@ -65,12 +65,14 @@ The simulation ends on **bankruptcy** (funds < 0 after payroll), **horizon end**

 ### Key mechanics

- **Funds**: start at $250K. Monthly payroll is deducted automatically. Task rewards scale with prestige (`base × (1 + 0.55 × (prestige − 1))`).
+- **Funds**: starting capital varies by preset ($80K–$250K). Monthly payroll is deducted automatically. Task rewards scale with prestige (`base × (1 + scale × (prestige − 1))`).
 - **4 domains**: `research · inference · data/environment · training`. Each domain tracks prestige independently in [1.0, 10.0].
- **Prestige gating**: tasks require a minimum prestige level. Most tasks need prestige 3–5, so the agent must climb from 1.0 by completing easier tasks first. First 10 market tasks are stratified `[1,1,1,1,2,2,2,3,3,4]` to bootstrap progression.
+- **Per-domain prestige gating**: a task's required prestige is checked against **each** of its required domains. The agent must climb prestige broadly, not just in one domain.
+- **Prestige decay**: every domain loses prestige daily. Neglected domains decay back toward 1.0. The agent must stay active across domains to maintain market access.
+- **Prestige-scaled work volume**: higher-prestige tasks require proportionally more work. Higher prestige pays more but demands more capacity.
 - **Employees**: 10 employees across 3 tiers (junior/mid/senior). The agent sees only each employee's tier and salary — not their per-domain skill rates. A junior can secretly be a superstar in one domain, so the agent must infer productivity from task progress observations.
 - **Throughput splitting**: an employee assigned to N active tasks has `effective_rate = base_rate / N`. Focus beats breadth.
- **Task success**: on-time completion awards funds + prestige + skill boosts + 1% salary bump (compounding payroll pressure). Late completion penalises prestige (1.4×). Cancellation penalises harder (2.0×).
+- **Task success**: on-time completion awards funds + prestige + skill boosts + 1% salary bump (compounding payroll pressure). Late completion penalises prestige. Cancellation penalises harder.
 - **Progress checkpoints**: the agent is woken at 25%, 50%, 75%, and 100% completion — providing data points to estimate employee productivity.
 - **Scratchpad**: persistent notes in the DB that survive context truncation (only last 20 conversation rounds are kept).

@ -92,7 +94,7 @@ yc-bench report monthly                          # P&L per month
 yc-bench task accept --task-id UUID              # pull from market
 yc-bench task assign --task-id UUID --employee-id UUID
 yc-bench task dispatch --task-id UUID            # start work
-yc-bench task cancel --task-id UUID --reason ""  # cancel (2× prestige penalty)
+yc-bench task cancel --task-id UUID --reason ""  # cancel (prestige penalty)
 yc-bench sim resume                              # advance time
 yc-bench scratchpad write/append/clear           # persistent memory
 ```
@ -103,13 +105,15 @@ yc-bench scratchpad write/append/clear           # persistent memory

 Experiment presets live in `src/yc_bench/config/presets/` as TOML files. Pass the preset name via `--config`.

-| Config | Employees | Tasks | Tests |
-|--------|-----------|-------|-------|
-| **tutorial** | 3 | 50 | Basic accept→assign→dispatch loop |
-| **easy** | 5 | 100 | Throughput awareness |
-| **medium** | 5 | 150 | Prestige climbing + domain specialization |
-| **hard** | 7 | 200 | Precise ETA reasoning |
-| **nightmare** | 8 | 300 | Sustained perfection under compounding payroll |
+All presets use 10 employees and 200 market tasks. Difficulty comes from deadline pressure, penalty severity, prestige distribution, and task size.
+
+| Config | Deadline pressure | Prestige mode | What it tests |
+|--------|------------------|---------------|---------------|
+| **tutorial** | Very relaxed | 1 | Basic accept→assign→dispatch loop |
+| **easy** | Relaxed | 1 | Throughput awareness |
+| **medium** | Moderate | 3 | Prestige climbing + domain specialization |
+| **hard** | Tight | 4 | Precise ETA reasoning + capacity planning |
+| **nightmare** | Razor-thin | 5 | Sustained perfection under compounding payroll |

 See `default.toml` for the full list of tunable parameters.

@ -117,44 +121,7 @@ See `default.toml` for the full list of tunable parameters.

 ## Benchmark results

-### Sonnet 4.6 vs Gemini 3 Flash vs GPT-5.2 — 1-year horizon, 3 seeds per config
-
-![3-model comparison](plots/sonnet_vs_gemini.png)
-
-#### Survival rates
-
-| Config | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
-|--------|-----------|----------------|---------|
-| **medium** | 3/3 | 3/3 | 3/3 |
-| **hard** | 1/3 | 2/3 | 2/3 |
-| **nightmare** | 1/3 | 3/3 | 2/3 |
-
-#### Final funds (bankrupt = funds < 0)
-
-| Config | Seed | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
-|--------|------|-----------|----------------|---------|
-| medium | 1 | **$9.1M** | **$9.5M** | **$1.8M** |
-| medium | 2 | **$6.1M** | **$11.0M** | **$321K** |
-| medium | 3 | **$107K** | **$15.8M** | **$28K** |
-| hard | 1 | bankrupt | bankrupt | bankrupt |
-| hard | 2 | **$63K** | **$412K** | **$15.7M** |
-| hard | 3 | bankrupt | **$21.9M** | **$43.5M** |
-| nightmare | 1 | bankrupt | **$2.1M** | bankrupt |
-| nightmare | 2 | **$10.1M** | **$214K** | **$2.2M** |
-| nightmare | 3 | bankrupt | **$805K** | **$23.6M** |
-
-**Overall: Gemini 8/9 · GPT-5.2 7/9 · Sonnet 5/9**
-
-#### Key findings
-
- **Gemini leads on consistency** (8/9 survival). The only model to sweep all 3 nightmare seeds.
- **GPT-5.2 has the highest ceiling.** Hard seed 3: $43.5M vs Gemini's $21.9M. When it survives, it tends to outperform by a wide margin.
- **Sonnet is high-variance.** Nightmare seed 2: $10.1M (best nightmare result), but 4/9 bankruptcies overall.
- **Win rate predicts survival.** Every run with >58% task win rate survived. Every run below 40% went bankrupt.
-
-#### Prestige specialization
-
-![Prestige radar](plots/prestige_radar.png)
+*Results pending — re-running benchmarks with updated economics.*

 ---

--- a/plots/hard_1_gpt-5.4_funds.png
+++ b/plots/hard_1_gpt-5.4_funds.png
--- a/plots/hard_1_gpt-5.4_prestige.png
+++ b/plots/hard_1_gpt-5.4_prestige.png
--- a/results/yc_bench_result_hard_1_openai_gpt-5.4.json
+++ b/results/yc_bench_result_hard_1_openai_gpt-5.4.json
--- a/results/yc_bench_result_medium_1_openai_gpt-5.4.json
+++ b/results/yc_bench_result_medium_1_openai_gpt-5.4.json
--- a/scripts/bot_runner.py
+++ b/scripts/bot_runner.py
@ -346,6 +346,7 @@ def run_bot(config_name: str, seed: int, bot_slug: str, strategy_fn: StrategyFn)
            replacement = generate_replacement_task(
                run_seed=sim_state.run_seed,
                replenish_counter=counter,
+                replaced_prestige=best_task.required_prestige,
                cfg=world_cfg,
            )
            replacement_row = Task(
--- a/scripts/plot_multi_model.py
+++ b/scripts/plot_multi_model.py
@ -26,7 +26,7 @@ DEFAULT_RUNS = [
    {"label": "kimi-k2.5",    "model_slug": "openrouter_moonshotai_kimi-k2.5",           "color": "#2ecc71"},
 ]

-INITIAL_FUNDS_CENTS = 25_000_000  # $250K
+INITIAL_FUNDS_CENTS = 15_000_000  # $150K (default; presets may override)


 def parse_args():
@ -129,7 +129,7 @@ def make_plot(run_data, seed, config_name, budget_usd, out_path: Path):

    # ── Funds curves ─────────────────────────────────────────────────────────
    ax_funds.axhline(0, color="#e74c3c", linewidth=0.9, linestyle="--", alpha=0.4, zorder=1)
-    ax_funds.axhline(250_000, color="#555577", linewidth=0.7, linestyle=":", alpha=0.6, zorder=1)
+    ax_funds.axhline(INITIAL_FUNDS_CENTS / 100, color="#555577", linewidth=0.7, linestyle=":", alpha=0.6, zorder=1)

    for r in run_data:
        if not r["times"]:
--- a/src/yc_bench/cli/task_commands.py
+++ b/src/yc_bench/cli/task_commands.py
@ -88,6 +88,7 @@ def task_accept(
        replacement = generate_replacement_task(
            run_seed=sim_state.run_seed,
            replenish_counter=counter,
+            replaced_prestige=task.required_prestige,
            cfg=_get_world_cfg(),
        )

--- a/src/yc_bench/config/presets/default.toml
+++ b/src/yc_bench/config/presets/default.toml
@ -53,7 +53,7 @@ company_name  = "BenchCo"

 [world]
 num_employees            = 10
-initial_funds_cents      = 25_000_000    # $250,000
+initial_funds_cents      = 15_000_000    # $150,000
 initial_prestige_level   = 1.0
 work_hours_per_day       = 9.0

@ -78,9 +78,9 @@ penalty_cancel_multiplier  = 2.0    # hardened: was 1.2
 reward_prestige_scale = 0.55    # hardened: was 0.3

 # Daily prestige decay per domain. Domains not exercised lose prestige
-# over time: -0.01/day → -0.3/month. Untouched domain drops ~1 level
-# every ~3 months. Prevents single-domain hyper-specialization.
-prestige_decay_per_day = 0.01
+# over time: -0.005/day → -0.15/month. Untouched domain drops ~1 level
+# every ~6 months. Prevents single-domain hyper-specialization.
+prestige_decay_per_day = 0.005

 # Required qty scaling by prestige: qty *= 1 + scale * (prestige - 1).
 # At 0.3: prestige-5 tasks need 2.2x the work of prestige-1 tasks.
@ -90,7 +90,7 @@ prestige_qty_scale = 0.3
 # --- Deadline ---
 # Deadline = max(deadline_min_biz_days, max_domain_qty / deadline_qty_per_day).
 # Domains are worked in parallel, so deadline scales with heaviest domain, not sum.
-deadline_qty_per_day   = 150.0
+deadline_qty_per_day   = 200.0
 deadline_min_biz_days  = 7

 # --- Progress milestones (checkpoint events at these completion fractions) ---
@ -120,12 +120,12 @@ high = 10
 mode = 4            # hardened: base default is mode=1

 # Base reward paid on task completion, in cents (scaled further by prestige).
-# Higher-prestige tasks automatically pay more via reward_prestige_scale.
+# Mode $14K: prestige-1 tasks burn cash, prestige-3 breaks even, prestige-4+ profits.
 [world.dist.reward_funds_cents]
 type = "triangular"
-low  = 500_000      # $5,000
-high = 10_000_000   # $100,000
-mode = 3_000_000    # $30,000
+low  = 300_000      # $3,000
+high = 4_000_000    # $40,000
+mode = 1_400_000    # $14,000

 # Number of domains each task requires work in (cast to int after sampling).
 # mode=2: most tasks need 2 domains — single-specialist dominance gone.
@ -139,9 +139,9 @@ mode = 2            # hardened: base default is mode=1
 # No trivially-small tasks: every task requires sustained employee-hours.
 [world.dist.required_qty]
 type = "triangular"
-low  = 500          # hardened: base default is 200
-high = 3000
-mode = 1400         # hardened: base default is 800
+low  = 800          # hardened: base default is 200
+high = 4000
+mode = 2000         # hardened: base default is 800

 # Prestige delta awarded per domain on task success.
 # Mean ~0.1: climbing prestige 1→5 takes ~40 tasks.
--- a/src/yc_bench/config/presets/easy.toml
+++ b/src/yc_bench/config/presets/easy.toml
@ -28,10 +28,11 @@ horizon_years = 1
 auto_advance_after_turns = 8

 [world]
+initial_funds_cents = 20_000_000    # $200,000
 # Inherits num_employees=10, num_market_tasks=200 from default.

-# Moderate deadlines: 60 qty/day → ~12 day deadline. Comfortable with 3–4 tasks.
-deadline_qty_per_day = 60.0
+# Moderate deadlines: 100 qty/day → 10-day deadline for mode task.
+deadline_qty_per_day = 100.0

 # Original (un-hardened) penalties — costly but not catastrophic.
 penalty_fail_multiplier   = 0.8
@ -55,6 +56,6 @@ value = 1        # Single-domain — the test is about throughput, not assignmen

 [world.dist.required_qty]
 type = "triangular"
-low  = 300
-high = 1500
-mode = 700       # Moderate size — a few days of focused work each.
+low  = 500
+high = 2000
+mode = 1000      # Larger tasks — must stay focused, no excessive parallelism.
--- a/src/yc_bench/config/presets/hard.toml
+++ b/src/yc_bench/config/presets/hard.toml
@ -40,12 +40,13 @@ horizon_years = 1
 auto_advance_after_turns = 10

 [world]
+initial_funds_cents = 10_000_000    # $100,000 — must reach prestige 3 by month 5
 # Inherits num_employees=10, num_market_tasks=200 from default.

-# Tight deadlines: 1200/150 = 8 days.
-# 1 task with 5 per domain → 5.8 days. OK.
-# 2 concurrent tasks → 11.6 days. Miss.
-deadline_qty_per_day = 150.0
+# Tight deadlines: 2000/220 = 9.1 days.
+# 1 task with 5 per domain → 8.7 days. Just fits.
+# 2 concurrent tasks → 17.4 days. Guaranteed miss.
+deadline_qty_per_day = 220.0

 # Stiff penalties — mistakes cost real prestige.
 penalty_fail_multiplier   = 1.4
@ -71,6 +72,6 @@ mode = 2        # Most tasks need 2 domains.

 [world.dist.required_qty]
 type = "triangular"
-low  = 500
-high = 2500
-mode = 1200     # Large tasks — require sustained focus.
+low  = 1000
+high = 4000
+mode = 2000     # Large tasks — each takes ~9 days with full team. No parallelism.
--- a/src/yc_bench/config/presets/medium.toml
+++ b/src/yc_bench/config/presets/medium.toml
@ -38,10 +38,10 @@ auto_advance_after_turns = 8
 [world]
 # Inherits num_employees=10, num_market_tasks=200 from default.

-# Deadline uses max per-domain qty. 900/100 = 9 days.
-# 2 concurrent tasks: 5 per task → 4.3 days each. Manageable.
-# 3 concurrent tasks: 3.3 per task → 6.6 days. Risky.
-deadline_qty_per_day = 100.0
+# Deadline uses max per-domain qty. 1500/150 = 10 days.
+# 1 task with 5 per domain → 6.5 days. Comfortable.
+# 2 concurrent tasks → 13 days. Miss.
+deadline_qty_per_day = 150.0

 # Real penalties — failing costs prestige, cancelling costs more.
 penalty_fail_multiplier   = 1.0
@ -67,6 +67,6 @@ mode = 2        # Most tasks need 2 domains.

 [world.dist.required_qty]
 type = "triangular"
-low  = 400
-high = 2000
-mode = 900      # Moderate work — completable in 7–12 days with focus.
+low  = 700
+high = 3000
+mode = 1500     # Larger tasks — ~6.5 days with full team, no parallelism.
--- a/src/yc_bench/config/presets/nightmare.toml
+++ b/src/yc_bench/config/presets/nightmare.toml
@ -49,12 +49,13 @@ horizon_years = 1
 auto_advance_after_turns = 10

 [world]
+initial_funds_cents = 8_000_000     # $80,000 — razor-thin runway
 # Inherits num_employees=10, num_market_tasks=200 from default.

-# Razor deadlines: 1600/200 = 8 days.
-# 1 task with 5 per domain → 7.7 days. Barely makes it.
-# 2 concurrent tasks → guaranteed miss.
-deadline_qty_per_day = 200.0
+# Razor deadlines: 2500/220 = 11.4 days.
+# 1 task with 5 per domain → 10.9 days. Barely fits.
+# 2 concurrent tasks → 21.8 days. Guaranteed miss.
+deadline_qty_per_day = 220.0

 # Catastrophic penalties — there is no good exit from a bad accept.
 penalty_fail_multiplier   = 2.0
@ -81,9 +82,9 @@ mode = 2        # Mostly 2-domain, some 3-domain.

 [world.dist.required_qty]
 type = "triangular"
-low  = 600
-high = 3000
-mode = 1600     # Large work volumes — no quick wins.
+low  = 1200
+high = 5000
+mode = 2500     # Massive work volumes — each task consumes the full team.

 # Slightly larger prestige gains than default (~0.13 avg) to make
 # climbing feasible despite the steep penalty. But one blown task
--- a/src/yc_bench/config/presets/tutorial.toml
+++ b/src/yc_bench/config/presets/tutorial.toml
@ -28,10 +28,11 @@ horizon_years = 1
 auto_advance_after_turns = 5

 [world]
+initial_funds_cents = 25_000_000    # $250,000 — very forgiving buffer
 # Inherits num_employees=10, num_market_tasks=200 from default.

-# Very generous deadlines: 30 qty/day → most tasks get 13+ day deadline.
-deadline_qty_per_day = 30.0
+# Generous deadlines: 50 qty/day → mode task gets 12-day deadline.
+deadline_qty_per_day = 50.0

 # Negligible penalties — mistakes barely hurt.
 penalty_fail_multiplier   = 0.3
@ -53,6 +54,6 @@ value = 1        # ALL tasks single-domain — trivial assignment.

 [world.dist.required_qty]
 type = "triangular"
-low  = 200
-high = 800
-mode = 400       # Small tasks, quick completions.
+low  = 300
+high = 1200
+mode = 600       # Moderate tasks, comfortable with focused execution.
--- a/src/yc_bench/config/schema.py
+++ b/src/yc_bench/config/schema.py
@ -39,7 +39,7 @@ class WorldDists(BaseModel):
    )
    # Base reward paid on task completion, in cents (result cast to int).
    reward_funds_cents: DistSpec = Field(
-        default_factory=lambda: TriangularDist(low=500_000, high=10_000_000, mode=3_000_000)
+        default_factory=lambda: TriangularDist(low=300_000, high=4_000_000, mode=1_400_000)
    )
    # Number of domains required per task (result cast to int).
    domain_count: DistSpec = Field(
@ -105,7 +105,7 @@ class SimConfig(BaseModel):
 class WorldConfig(BaseModel):
    # --- Workforce ---
    num_employees: int = 10
-    initial_funds_cents: int = 25_000_000    # $250,000
+    initial_funds_cents: int = 15_000_000    # $150,000
    initial_prestige_level: float = 1.0
    work_hours_per_day: float = 9.0

@ -128,7 +128,7 @@ class WorldConfig(BaseModel):
    # Daily prestige decay per domain. Domains not exercised lose prestige
    # over time: -0.01/day → -0.3/month → untouched domain drops ~1 level
    # every ~3 months. Floored at prestige_min.
-    prestige_decay_per_day: float = 0.01
+    prestige_decay_per_day: float = 0.005

    # Required qty scaling by prestige: qty *= 1 + prestige_qty_scale * (prestige - 1).
    # At 0.3: prestige-5 tasks need 2.2× the work of prestige-1 tasks.
--- a/src/yc_bench/core/engine.py
+++ b/src/yc_bench/core/engine.py
@ -178,6 +178,13 @@ def advance_time(
            result.payrolls_applied += 1
            payroll_idx += 1

+            # Report payroll as a wake event so the agent gets control back
+            company = db.query(Company).filter(Company.id == company_id).one()
+            result.wake_events.append({
+                "type": "monthly_payroll",
+                "funds_after": company.funds_cents,
+            })
+
            if bankrupt:
                # Insert bankruptcy event at this time
                insert_event(
@ -188,7 +195,9 @@ def advance_time(
                    dedupe_key=f"bankruptcy:{current_time.isoformat()}",
                )
                result.bankrupt = True
-                break
+
+            # Always stop at payroll — gives the agent a chance to act
+            break

        elif action_type == "event":
            event_result = dispatch_event(db, next_event, current_time, company_id)
--- a/src/yc_bench/db/models/task.py
+++ b/src/yc_bench/db/models/task.py
@ -87,7 +87,7 @@ class Task(Base):
 class TaskRequirement(Base):
    __tablename__ = "task_requirements"
    __table_args__ = (
-        CheckConstraint("required_qty >= 200 AND required_qty <= 3000", name="ck_task_requirements_required_qty_range"),
+        CheckConstraint("required_qty >= 200 AND required_qty <= 25000", name="ck_task_requirements_required_qty_range"),
        CheckConstraint("completed_qty >= 0", name="ck_task_requirements_completed_qty_gte_0"),
        CheckConstraint("completed_qty <= required_qty", name="ck_task_requirements_completed_qty_lte_required_qty"),
    )
--- a/src/yc_bench/services/generate_tasks.py
+++ b/src/yc_bench/services/generate_tasks.py
@ -27,10 +27,9 @@ class GeneratedTask:
    requirements: dict[str, int]


-# First 10 market tasks are given explicit prestige values to guarantee a
-# climbable ladder from the start (avoids runs where all early tasks need
-# prestige 4+ before any are completable).
-_STRATIFIED_PRESTIGE = [1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
+# First 10 market tasks are forced to prestige 1 to guarantee a
+# bootstrapping path regardless of the prestige distribution.
+_STRATIFIED_PRESTIGE = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

 _ALL_DOMAINS = list(Domain)

@ -134,14 +133,14 @@ def build_task_rows(*, run_seed, count, cfg=None):
    return task_rows, requirement_rows


-def generate_replacement_task(*, run_seed, replenish_counter, cfg=None):
+def generate_replacement_task(*, run_seed, replenish_counter, replaced_prestige, cfg=None):
+    """Generate a replacement task with the same prestige as the accepted task."""
    if cfg is None:
        cfg = WorldConfig()
    streams = RngStreams(run_seed)
    rng = streams.stream(f"replenish_{replenish_counter}")
-    prestige = _sample_required_prestige(rng, cfg)
-    requirements = _sample_requirements(rng, cfg, prestige=prestige)
-    return _make_task(rng, cfg, prestige, serial=replenish_counter, requirements=requirements)
+    requirements = _sample_requirements(rng, cfg, prestige=replaced_prestige)
+    return _make_task(rng, cfg, replaced_prestige, serial=replenish_counter, requirements=requirements)


 __all__ = [