Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4

2026-04-19 12:58:03 +00:00 · 2026-03-05 18:12:48 -08:00 · 2026-03-05 18:12:48 -08:00 · eb18c5a90c
commit eb18c5a90c
parent 6d6f0a855d
22 changed files with 226 additions and 864 deletions
--- a/README.md
+++ b/README.md
@ -2,180 +2,7 @@

 A long-horizon deterministic benchmark for LLM agents. The agent plays CEO of an AI startup over a simulated 1–3 year run, operating exclusively through a CLI tool against a SQLite-backed discrete-event simulation.

-The benchmark tests whether agents can manage compounding decisions: prestige specialisation, employee allocation, cash flow, and deadline risk - sustained over hundreds of turns.
-
---
-
-## Simulation Dynamics
-
-![YC Bench Architecture](imgs/arch.png "Architecture YC-Bench")
-
-<!-- ```
-┌─────────────────────────────────────────────────────────────────────────┐
-│                          AGENT (LLM)                                    │
-│                                                                         │
-│  Observes: company status · employee skills · market tasks · ledger     │
-│  Acts via: run_command("yc-bench <cmd>")  ·  scratchpad (persistent)    │
-└───────────────────────┬─────────────────────────────────────────────────┘
-                        │ CLI commands (JSON responses)
-                        ▼
-┌─────────────────────────────────────────────────────────────────────────┐
-│                     DISCRETE-EVENT SIMULATION                           │
-│                                                                         │
-│  ┌─────────────┐    accept    ┌──────────┐   assign+dispatch            │
-│  │   MARKET    │ ──────────►  │  PLANNED │ ──────────────────►          │
-│  │  100 tasks  │              └──────────┘                              │
-│  └─────────────┘                                                        │
-│        ▲ replenish                     ┌──────────────────────┐         │
-│        │                               │       ACTIVE         │         │
-│        │   ┌────────────────────────── │  progress flushes    │         │
-│        │   │                           │  every sim-advance   │         │
-│        │   │                           └──────────┬───────────┘         │
-│        │   │  ┌───────────────────────────────────┘                     │
-│        │   │  │  ETA solver fires TASK_COMPLETED event                  │
-│        │   │  ▼                                                         │
-│        │   │  ┌────────────────────────────────────────────────────┐    │
-│        │   │  │            TASK_COMPLETED handler                  │    │
-│        │   │  │                                                    │    │
-│        │   │  │  on_time?  YES → +reward_funds  +prestige_delta    │    │
-│        │   │  │                  +skill_boost   +salary_bump       │    │
-│        │   │  │            NO  → -1.4× prestige_delta (penalty)    │    │
-│        └───┘  └─────────────────────┬───────────────────────────── ┘    │
-│                                     │                                   │
-│  ┌──────────────────────────────────┘                                   │
-│  │  Monthly payroll (1st biz day)    Bankruptcy check (funds < 0)       │
-│  │  Horizon end (1–3 years)          Context truncation (last 20 rounds)│
-└──┴──────────────────────────────────────────────────────────────────────┘
-``` -->
-
-### Core loop
-
-1. Agent calls `yc-bench sim resume` to advance time to the next event.
-2. The engine flushes task progress, fires due events, applies payroll.
-3. Agent reads wake events and decides: accept tasks, assign employees, dispatch, cancel.
-4. Repeat until bankruptcy or horizon end.
-
-If the agent doesn't call `sim resume` for N consecutive turns (default 10), the loop forces one automatically.
-
---
-
-## Economy
-
-### Funds
-
- Start: **$250,000** (`initial_funds_cents = 25_000_000`)
- Payroll deducted on the **first business day of each month**
- Task reward formula: `base × (1 + reward_prestige_scale × (prestige_req − 1))`
-  - Base: triangular sample in [$5K, $100K], mode $30K
-  - `reward_prestige_scale = 0.55` (default): a prestige-8 task pays ~4.85× more than prestige-1
-
-### Monthly payroll (5 employees, fast_test)
-
-| Tier | Share | Salary/month | Skill rate |
-|------|-------|-------------|------------|
-| Junior | 50% | $2K–$4K | 1.0–6.5 units/hr |
-| Mid | 35% | $6K–$8K | 3.5–8.5 units/hr |
-| Senior | 15% | $10K–$15K | 5.5–10.0 units/hr |
-
-Monthly payroll ≈ **$32K** (5 employees). Starting runway ≈ **7.8 months**.
-
-### Task completion rewards
-
-On success:
- Funds += `reward_funds_cents`
- Prestige += `reward_prestige_delta` (beta-distributed, typically 0.1–1.5) per required domain
- Skill rate += `skill_boost_pct × current_rate` per assigned employee per domain
- Salary += `1% × current_salary` per assigned employee (compounding payroll pressure)
-
-On failure (past deadline):
- Prestige −= `1.4 × reward_prestige_delta` per domain
-
-On cancel:
- Prestige −= `2.0 × reward_prestige_delta` per domain
-
---
-
-## Prestige
-
-7 domains: `system · research · data · frontend · backend · training · hardware`
-
- Range: **[1.0, 10.0]** per domain, starts at 1.0
- Tasks require a minimum prestige level. Agent can only accept tasks where `max(company_prestige) >= required_prestige`.
- Default distribution: mode=4, so most tasks need prestige 3–5.
- First 10 market tasks are stratified `[1,1,1,1,2,2,2,3,3,4]` to bootstrap progression.
-
-Specialising in 2–3 domains unlocks progressively higher-reward tasks. Spreading thin keeps you locked at low prestige everywhere.
-
---
-
-## Employee throughput
-
-Each employee has a skill rate (units/hr) per domain.
-
-When an employee is assigned to N active tasks simultaneously:
-
-```
-effective_rate_per_task = base_rate / N
-```
-
-Assigning one senior (rate 8.0) to 4 tasks gives 2.0 units/hr each — often worse than a junior focused on one.
-
-Task completion time = `max(remaining[d] / effective_rate[d])` across all required domains.
-
-Deadline = `max(7, total_required_qty / deadline_qty_per_day)` business days.
-
-`deadline_qty_per_day = 200` in both `challenge` and `fast_test`. With 10 employees and 5 focused per domain, team throughput ≈ 230 units/domain/day — achievable for up to ~4 simultaneous tasks.
-
---
-
-## Agent interface
-
-All commands return JSON to stdout.
-
-### Observe
-```bash
-yc-bench company status              # funds, prestige, runway, payroll
-yc-bench employee list               # skills, salary, active tasks
-yc-bench market browse               # available tasks (--limit N --offset N)
-yc-bench task list [--status X]      # planned|active|completed_*|cancelled
-yc-bench task inspect --task-id UUID # progress %, deadline, assignments
-yc-bench finance ledger              # full transaction history
-yc-bench report monthly              # P&L per month
-yc-bench scratchpad read             # persistent notes (survives context truncation)
-```
-
-### Act
-```bash
-yc-bench task accept --task-id UUID             # pull from market, set deadline
-yc-bench task assign --task-id UUID --employee-id UUID
-yc-bench task dispatch --task-id UUID           # start work (≥1 assignment required)
-yc-bench task cancel --task-id UUID --reason "" # 2× prestige penalty
-yc-bench sim resume                             # advance to next event
-yc-bench scratchpad write/append/clear          # persistent memory
-```
-
---
-
-## Context management
-
- **Proactive truncation**: keeps the last 20 conversation rounds before each API call. Older rounds are dropped.
- **Scratchpad**: per-company persistent text in DB. Survives truncation. Use it to store strategy, deadlines, and employee assignments.
-
---
-
-## Repository layout
-
-```
-YC_Bench/
-├── src/              # Python package (yc_bench)
-├── scripts/          # plot_multi_model.py, run_benchmark.sh
-├── logs/             # per-model stdout/stderr logs
-├── db/               # SQLite databases (one per model run)
-├── results/          # JSON rollout files
-├── plots/            # generated PNG charts
-├── pyproject.toml
-└── README.md
-```
+The benchmark tests whether agents can manage compounding decisions: prestige specialisation, employee allocation, cash flow, and deadline risk — sustained over hundreds of turns.

 ---

@ -194,8 +21,6 @@ cd YC_Bench
 uv sync
 ```

-No database setup required — the runner auto-creates `db/<config>_<seed>_<model>.db` on first run.
-
 ### API key

 ```bash
@ -206,7 +31,7 @@ OPENROUTER_API_KEY="sk-or-v1-..."  # for openrouter/*
 OPENAI_API_KEY="sk-..."            # for openai/*
 ```

-### Run a single model
+### Run

 ```bash
 uv run yc-bench run \
@ -215,65 +40,61 @@ uv run yc-bench run \
  --config medium
 ```

-Outputs:
- `db/medium_1_gemini_gemini-3-flash-preview.db` — SQLite simulation state
- `results/yc_bench_result_medium_1_gemini_gemini-3-flash-preview.json` — full rollout + transcript
+Outputs a SQLite DB in `db/` and a JSON rollout in `results/`.

-### Live dashboard
-
-When running in a terminal, YC-Bench displays an interactive dashboard that updates in-place after each turn:
-
-```
-╭──────────────────────────── YC-Bench ────────────────────────────╮
-│ Model        claude-haiku-4-5-20251001  seed=1  medium           │
-│ Turn         8                                                   │
-│ Sim Date     2025-03-06 -> 2026-01-01                            │
-│ Elapsed      0h 02m 34s                                          │
-│ Funds        $186,271.66 -$63,728 ██▇▃▁                          │
-│ Runway       5.8mo                                               │
-│ Tasks        3 active / 3 queued  2 done 1 fail                  │
-│ Team         5 people  $31,864.17/mo                              │
-│ Cost         $0.0212  (3.7s/turn)                                │
-│ Action       yc-bench task dispatch 7                            │
-│ Status       >> Turn 9: waiting for LLM...                       │
-╰──────────────────────────────────────────────────────────────────╯
-╭──────────────────────────── Tasks ───────────────────────────────╮
-│ >> Build GPU Cluster    $64,152  2025-02-03  Research ==== Training ====== │
-│ >> Deploy Observability $27,908  2025-01-22  Data ===...                   │
-│ .. Blue-Green Deploy    $30,780  2025-03-18  Backend ...... Data ......    │
-╰──────────────────────────────────────────────────────────────────╯
-╭──────────────────────────── Team ────────────────────────────────╮
-│ Alice Chen       $2,564  Training===. Frontend==.. Research=... │
-│ Bob Martinez    $14,947  Backend===. Research==.. Data==..      │
-╰──────────────────────────────────────────────────────────────────╯
-```
-
-The dashboard shows:
- **Funds sparkline** — visual trend of your cash position over time
- **Color-coded progress bars** per domain on each task (green = done, yellow = partial, red = low)
- **Employee skill bars** — top 3 skills per team member with strength indicators
- **Runway urgency** — green (safe), yellow (low), red blinking (critical)
- **Salary heat** — expensive employees highlighted in red
-
-To disable the dashboard and see raw log output instead:
-
-```bash
-uv run yc-bench run --model ... --seed 1 --config medium --no-live
-```
-
-When `--no-live` is set (or stdout is not a terminal, e.g. piped to a file), the original logging output is used. Debug logs from LiteLLM/httpx are written to `logs/debug.log` when the dashboard is active.
-
-### Run 5 models in parallel
+### Run multiple models in parallel

 ```bash
 bash scripts/run_benchmark.sh --seed 1 --config challenge
 ```

-### Generate the comparison plot
+---
+
+## How it works
+
+![YC Bench Architecture](imgs/arch.png "Architecture YC-Bench")
+
+### Core loop
+
+1. Agent calls `yc-bench sim resume` to advance time to the next event.
+2. The engine flushes task progress, fires due events, applies payroll.
+3. Agent reads wake events and decides: accept tasks, assign employees, dispatch, cancel.
+4. Repeat until bankruptcy or horizon end.
+
+The simulation ends on **bankruptcy** (funds < 0 after payroll), **horizon end** (1–3 years), or **max turns** (if configured). If the agent doesn't call `sim resume` for 10 consecutive turns, the loop forces one automatically.
+
+### Key mechanics
+
+- **Funds**: start at $250K. Monthly payroll is deducted automatically. Task rewards scale with prestige (`base × (1 + 0.55 × (prestige − 1))`).
+- **4 domains**: `research · inference · data/environment · training`. Each domain tracks prestige independently in [1.0, 10.0].
+- **Prestige gating**: tasks require a minimum prestige level. Most tasks need prestige 3–5, so the agent must climb from 1.0 by completing easier tasks first. First 10 market tasks are stratified `[1,1,1,1,2,2,2,3,3,4]` to bootstrap progression.
+- **Employees**: 10 employees across 3 tiers (junior/mid/senior). The agent sees only each employee's tier and salary — not their per-domain skill rates. A junior can secretly be a superstar in one domain, so the agent must infer productivity from task progress observations.
+- **Throughput splitting**: an employee assigned to N active tasks has `effective_rate = base_rate / N`. Focus beats breadth.
+- **Task success**: on-time completion awards funds + prestige + skill boosts + 1% salary bump (compounding payroll pressure). Late completion penalises prestige (1.4×). Cancellation penalises harder (2.0×).
+- **Progress checkpoints**: the agent is woken at 25%, 50%, 75%, and 100% completion — providing data points to estimate employee productivity.
+- **Scratchpad**: persistent notes in the DB that survive context truncation (only last 20 conversation rounds are kept).
+
+### Agent CLI
+
+All commands return JSON. The agent interacts via `run_command("yc-bench <cmd>")`.

 ```bash
-uv run python scripts/plot_multi_model.py --seed 1 --config challenge --budget 30
-# → plots/funds_curves.png
+# Observe
+yc-bench company status                          # funds, prestige, runway
+yc-bench employee list                           # tier, salary, active tasks
+yc-bench market browse [--domain X] [--limit N]  # available tasks
+yc-bench task list [--status X]                  # your tasks
+yc-bench task inspect --task-id UUID             # progress, deadline, assignments
+yc-bench finance ledger                          # transaction history
+yc-bench report monthly                          # P&L per month
+
+# Act
+yc-bench task accept --task-id UUID              # pull from market
+yc-bench task assign --task-id UUID --employee-id UUID
+yc-bench task dispatch --task-id UUID            # start work
+yc-bench task cancel --task-id UUID --reason ""  # cancel (2× prestige penalty)
+yc-bench sim resume                              # advance time
+yc-bench scratchpad write/append/clear           # persistent memory
 ```

 ---
@ -282,90 +103,15 @@ uv run python scripts/plot_multi_model.py --seed 1 --config challenge --budget 3

 Experiment presets live in `src/yc_bench/config/presets/` as TOML files. Pass the preset name via `--config`.

-```
-src/yc_bench/config/presets/
-├── default.toml      # 3yr, 10 employees, 500 tasks — base config
-├── tutorial.toml     # 1yr,  3 employees,  50 tasks — learn the loop
-├── easy.toml         # 1yr,  5 employees, 100 tasks — throughput awareness
-├── medium.toml       # 1yr,  5 employees, 150 tasks — prestige strategy
-├── hard.toml         # 1yr,  7 employees, 200 tasks — precise ETA reasoning
-├── nightmare.toml    # 1yr,  8 employees, 300 tasks — sustained perfection
-├── challenge.toml    # 3yr,  5 employees, 200 tasks — long-horizon endurance
-└── fast_test.toml    # 1yr,  5 employees, 100 tasks — quick iteration
-```
+| Config | Employees | Tasks | Tests |
+|--------|-----------|-------|-------|
+| **tutorial** | 3 | 50 | Basic accept→assign→dispatch loop |
+| **easy** | 5 | 100 | Throughput awareness |
+| **medium** | 5 | 150 | Prestige climbing + domain specialization |
+| **hard** | 7 | 200 | Precise ETA reasoning |
+| **nightmare** | 8 | 300 | Sustained perfection under compounding payroll |

-Each difficulty level tests one additional concept:
-
-| Config | Tests | Key constraint |
-|--------|-------|---------------|
-| **tutorial** | Basic accept→assign→dispatch loop | All prestige-1, single domain |
-| **easy** | Throughput awareness | Don't over-parallelize |
-| **medium** | Prestige climbing + domain specialization | 2-domain tasks, prestige mode=3 |
-| **hard** | Precise ETA computation | One bad accept degrades in-flight tasks |
-| **nightmare** | Sustained perfection under compounding payroll | One failure ≈ fatal, salary bumps 2%/task |
-
-### Key WorldConfig parameters
-
-| Parameter | Default | Controls |
-|-----------|---------|---------|
-| `initial_funds_cents` | 25_000_000 | Starting cash ($250K) |
-| `num_employees` | 5 | Workforce size |
-| `num_market_tasks` | 100 | Market pool size |
-| `required_prestige_mode` | 4 | Peak of prestige-req distribution |
-| `domain_count_mode` | 2 | Most tasks require 2 domains |
-| `required_qty_low/mode` | 500 / 1400 | Task work volume (units) |
-| `deadline_qty_per_day` | 200 | Units completable per biz day (lower = easier) |
-| `deadline_min_biz_days` | 7 | Minimum deadline |
-| `penalty_fail_multiplier` | 1.4 | Prestige × this on deadline miss |
-| `penalty_cancel_multiplier` | 2.0 | Prestige × this on cancel |
-| `reward_prestige_scale` | 0.55 | Extra reward fraction per prestige level above 1 |
-| `salary_bump_pct` | 0.01 | Salary raise per employee per completed task |
-
-### AgentConfig
-
-| Parameter | Default | Controls |
-|-----------|---------|---------|
-| `model` | openrouter/openai/gpt-4o-mini | LLM model string |
-| `temperature` | 0.0 | Sampling temperature |
-| `history_keep_rounds` | 20 | Conversation rounds kept in context |
-
-### LoopConfig
-
-| Parameter | Default | Controls |
-|-----------|---------|---------|
-| `auto_advance_after_turns` | 5 | Force sim resume after N turns without one |
-| `max_turns` | 50 | Hard cap on agent turns (null = unlimited) |
-
-### Environment overrides
-
-```bash
-YC_BENCH_EXPERIMENT=fast_test     # select preset
-DATABASE_URL=sqlite:///custom.db  # SQLite path
-```
-
---
-
-## Terminal conditions
-
-| Condition | Trigger |
-|-----------|---------|
-| Horizon end | `sim_time >= start_date + horizon_years` |
-| Bankruptcy | `funds_cents < 0` after any payroll |
-| Error | Agent runtime exception (API failure, exhausted retries) |
-| Max turns | `turn_count >= max_turns` (if set) |
-
---
-
-## What makes it hard
-
-The hardened default is designed so that the obvious strategies fail:
-
- **Prestige-1 farming** is unprofitable. Most replacement tasks need prestige 3–5 and pay much more. Farming the bottom locks you out.
- **Single-specialist dominance** is gone. Most tasks need 2 domains. You must allocate across skill combinations.
- **Speculative accepting** is punished. Cancel penalty (2×) exceeds fail penalty (1.4×) so you can't accept everything and drop the losers.
- **Ignoring payroll** causes bankruptcy. ~$32K/month burns your $250K in 7.8 months — but task complexity means you must also pace your accepts.
- **Parallel dispatch** dilutes throughput. Splitting employees across too many tasks extends every deadline — focus beats breadth.
- **Salary bumps compound**. Every task completion raises assigned employee salaries 1%. Payroll creep accelerates over time.
+See `default.toml` for the full list of tunable parameters.

 ---

@ -375,15 +121,15 @@ The hardened default is designed so that the obvious strategies fail:

 ![3-model comparison](plots/sonnet_vs_gemini.png)

-#### Survival rates (at end of year 1)
+#### Survival rates

 | Config | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
 |--------|-----------|----------------|---------|
-| **medium** | 3/3 survived | 3/3 survived | 3/3 survived |
-| **hard** | 1/3 survived | 2/3 survived | 2/3 survived |
-| **nightmare** | 1/3 survived | 3/3 survived | 2/3 survived |
+| **medium** | 3/3 | 3/3 | 3/3 |
+| **hard** | 1/3 | 2/3 | 2/3 |
+| **nightmare** | 1/3 | 3/3 | 2/3 |

-#### Final funds at 1-year mark (bankrupt = funds < 0)
+#### Final funds (bankrupt = funds < 0)

 | Config | Seed | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
 |--------|------|-----------|----------------|---------|
@ -399,82 +145,21 @@ The hardened default is designed so that the obvious strategies fail:

 **Overall: Gemini 8/9 · GPT-5.2 7/9 · Sonnet 5/9**

-### Key findings
+#### Key findings

-**Gemini leads on consistency (8/9).** Near-perfect win rates on medium (93–98%), and the only model to sweep all 3 nightmare seeds. Achieves this without using the scratchpad — purely reactive, high-frequency decision-making.
+- **Gemini leads on consistency** (8/9 survival). The only model to sweep all 3 nightmare seeds.
+- **GPT-5.2 has the highest ceiling.** Hard seed 3: $43.5M vs Gemini's $21.9M. When it survives, it tends to outperform by a wide margin.
+- **Sonnet is high-variance.** Nightmare seed 2: $10.1M (best nightmare result), but 4/9 bankruptcies overall.
+- **Win rate predicts survival.** Every run with >58% task win rate survived. Every run below 40% went bankrupt.

-**GPT-5.2 excels at hard (2/3, matching Gemini) with the highest absolute returns.** Hard seed 3: $43.5M vs Gemini's $21.9M. Nightmare seed 3: $23.6M vs Gemini's $805K. When GPT-5.2 survives, it tends to outperform by a significant margin.
-
-**Sonnet has the highest ceiling when it works but the lowest floor.** Nightmare seed 2: $10.1M (best nightmare result). But 4/9 bankruptcies — Sonnet fails harder than the others on adverse seeds.
-
-**Hard is the differentiator config.** On easy configs all three survive. On hard/nightmare the strategies diverge sharply. Gemini plays safe and consistent; GPT-5.2 swings big; Sonnet is high-variance.
-
-**Win rate predicts survival.** Every run with >58% task win rate survived. Every run with <40% went bankrupt. Below that threshold, prestige losses from failures outpace gains and lock the agent out of profitable tasks.
-
-### Prestige specialization
+#### Prestige specialization

 ![Prestige radar](plots/prestige_radar.png)

-Each radar shows final prestige across 7 domains (1 = center, 10 = edge). Large polygons = the model climbed prestige broadly. Tiny dots near center = bankrupt before gaining any prestige. Pointy shapes = domain specialization.
-
-**Human Devised Rule** (navy dashed) consistently fills the full radar — it methodically maxes prestige everywhere. Among LLMs, **Gemini** builds the most balanced prestige profiles. **GPT-5.2** shows clear specialization on medium (backend/data/frontend high, training untouched). **Sonnet** is bimodal: either maxes everything (medium seed 1) or collapses entirely (nightmare seeds 1 & 3).
-
-### Why models fail
-
-The scratchpad evolution of Sonnet on hard seed 2 tells the full story:
-
-![Sonnet hard seed 2 scratchpad evolution](plots/notepad_hard_2_claude-sonnet-4-6.gif)
-
-Common failure patterns across all bankrupt runs:
-
-1. **Over-parallelization.** Accepting 3–5 tasks at once, splitting employees across them. Effective rate per task drops below deadline requirements. Sonnet nightmare seed 3 ran 5 tasks simultaneously with 8 employees on turn 13.
-2. **No prestige gating.** Accepting prestige-2 tasks when company prestige is 1.0. The task completes late, triggers a 1.4× prestige penalty, and the agent ends up worse than before.
-3. **Late adaptation.** Sonnet correctly identifies problems in its scratchpad ("PRESTIGE CRISIS — MARKET LOCK") but only after payroll has consumed the runway. By turn 137 of hard seed 2, all tasks require prestige ≥ 2 but the company is stuck at 1.0 in 6 of 7 domains.
-4. **Inconsistent ETA reasoning.** Sonnet's medium seed 2 has a 49% win rate — essentially a coin flip. It understands throughput math in its scratchpad but doesn't consistently apply it when selecting tasks.
-
 ---

-## Simulation rules
+Please cite our work if you find it useful!

- **Business time**: weekdays only, 09:00–18:00. No leap years.
- **Money**: stored as integer cents (`BIGINT`). No floating point.
- **Payroll**: fired on the first business day of each month.
- **Event ordering**: deterministic — `(scheduled_at, priority, id)`.
- **Determinism**: all task generation and employee seeding is reproducible given `--seed`.
- **Prestige**: `NUMERIC(6,3)`, hard clamped to `[1.0, 10.0]`.
- **DB reuse**: if a simulation is terminal (bankrupt or horizon reached), re-running with the same DB wipes and reseeds cleanly.
-
---
-
-## Output format
-
-`results/yc_bench_result_<config>_<seed>_<model>.json`:
-
-```json
-{
-  "session_id": "run-1-openrouter/openai/gpt-4o-mini",
-  "model": "openrouter/openai/gpt-4o-mini",
-  "seed": 1,
-  "horizon_years": 1,
-  "turns_completed": 46,
-  "terminal": true,
-  "terminal_reason": "bankruptcy",
-  "total_cost_usd": 0.100008,
-  "started_at": "...",
-  "ended_at": "...",
-  "transcript": [
-    {
-      "turn": 1,
-      "timestamp": "...",
-      "user_input": "## Simulation Start ...",
-      "agent_output": "Executed 3 tool call(s): ...",
-      "commands_executed": ["yc-bench company status -> {...}", ...]
-    }
-  ]
-}
-```
-
-Please cite our work if you find it useful and interesting!
 ```bibtex
@misc{collinear-ai2025ycbench,
  author       = {{Collinear AI}},
--- a/scripts/bot_runner.py
+++ b/scripts/bot_runner.py
@ -344,13 +344,12 @@ def run_bot(config_name: str, seed: int, bot_slug: str, strategy_fn: StrategyFn)
                company_id=None,
                status=TaskStatus.MARKET,
                title=replacement.title,
-                description=replacement.description,
                required_prestige=replacement.required_prestige,
                reward_funds_cents=replacement.reward_funds_cents,
                reward_prestige_delta=replacement.reward_prestige_delta,
                skill_boost_pct=replacement.skill_boost_pct,
                accepted_at=None, deadline=None, completed_at=None,
-                success=None, halfway_event_emitted=False,
+                success=None, progress_milestone_pct=0,
            )
            db.add(replacement_row)
            for domain, qty in replacement.requirements.items():
@ -375,7 +374,7 @@ def run_bot(config_name: str, seed: int, bot_slug: str, strategy_fn: StrategyFn)

            recalculate_etas(db, company_id, sim_state.sim_time,
                             impacted_task_ids={best_task.id},
-                             half_threshold=world_cfg.task_half_threshold)
+                             milestones=world_cfg.task_progress_milestones)

            task_cycles_used += 1

--- a/scripts/plot_prestige_radar.py
+++ b/scripts/plot_prestige_radar.py
@ -50,8 +50,8 @@ CONFIGS = ["medium", "hard", "nightmare"]
 SEEDS = [1, 2, 3]
 DIFF_COLORS = {"medium": BLUE, "hard": ORANGE, "nightmare": "#DC2626"}

-DOMAINS = ["system", "research", "data", "frontend", "backend", "training", "hardware"]
-DOMAIN_LABELS = ["SYS", "RES", "DATA", "FE", "BE", "TRAIN", "HW"]
+DOMAINS = ["research", "inference", "data_environment", "training"]
+DOMAIN_LABELS = ["RES", "INF", "DATA/ENV", "TRAIN"]


 def load_logo_image(height_px=80):
--- a/scripts/plot_run.py
+++ b/scripts/plot_run.py
@ -23,13 +23,10 @@ engine = build_engine()
 factory = build_session_factory(engine)

 DOMAIN_COLORS = {
-    "training": "#e67e22",
-    "research": "#3498db",
-    "backend":  "#2ecc71",
-    "hardware": "#9b59b6",
-    "data":     "#1abc9c",
-    "frontend": "#e74c3c",
-    "system":   "#95a5a6",
+    "research":         "#3498db",
+    "inference":        "#9b59b6",
+    "data_environment": "#1abc9c",
+    "training":         "#e67e22",
 }

 with session_scope(factory) as db:
--- a/src/yc_bench/agent/prompt.py
+++ b/src/yc_bench/agent/prompt.py
@ -18,7 +18,7 @@ Your goal is to maximize company prestige and funds over the simulation horizon

 ### Observe
 - `yc-bench company status` — funds, prestige, employee count, payroll, bankruptcy risk
- `yc-bench employee list` — list all employees with IDs, salaries, skill rates, and current assignments
+- `yc-bench employee list` — list all employees with IDs, tier (junior/mid/senior), salaries, and current assignments
 - `yc-bench market browse [--domain X] [--required-prestige-lte N] [--reward-min-cents N] [--limit N] [--offset N]` — browse available tasks (default limit 50; the response includes a `total` field — if total > 50, paginate with --offset to see more)
 - `yc-bench task list [--status X]` — list your tasks (planned, active, completed, cancelled)
 - `yc-bench task inspect --task-id <UUID>` — detailed task info (requirements, assignments, progress)
@ -106,7 +106,8 @@ def build_turn_context(
                tid = ev.get("task_id", "?")
                parts.append(f"- Task {tid}: {'SUCCESS' if success else 'FAILED'}")
            elif ev_type == "task_half":
-                parts.append(f"- Task {ev.get('task_id', '?')}: 50% progress reached")
+                pct = ev.get("milestone_pct", "?")
+                parts.append(f"- Task {ev.get('task_id', '?')}: {pct}% progress reached")
            elif ev_type == "horizon_end":
                parts.append("- **Horizon end reached. Simulation complete.**")
            elif ev_type == "bankruptcy":
--- a/src/yc_bench/cli/init.py
+++ b/src/yc_bench/cli/init.py
@ -9,7 +9,7 @@ from uuid import UUID

 import typer

-from ..db.session import build_engine, build_session_factory, session_scope
+from ..db.session import build_engine, build_session_factory, init_db, session_scope

 app = typer.Typer(name="yc-bench", add_completion=False)

@ -22,6 +22,7 @@ app = typer.Typer(name="yc-bench", add_completion=False)
 def get_db():
    """Yield a transactional SQLAlchemy session, commit on success."""
    engine = build_engine()
+    init_db(engine)
    factory = build_session_factory(engine)
    with session_scope(factory) as session:
        yield session
--- a/src/yc_bench/cli/employee_commands.py
+++ b/src/yc_bench/cli/employee_commands.py
@ -3,7 +3,7 @@ from __future__ import annotations
 import typer
 from sqlalchemy import func

-from ..db.models.employee import Employee, EmployeeSkillRate
+from ..db.models.employee import Employee
 from ..db.models.task import Task, TaskAssignment, TaskStatus
 from ..db.models.sim_state import SimState
 from . import get_db, json_output, error_output
@ -25,15 +25,6 @@ def employee_list():

        results = []
        for emp in employees:
-            # Skills
-            skills = db.query(EmployeeSkillRate).filter(
-                EmployeeSkillRate.employee_id == emp.id
-            ).all()
-            skill_map = {
-                s.domain.value: float(s.rate_domain_per_hour)
-                for s in skills
-            }
-
            # Current active assignments
            active_assignments = (
                db.query(TaskAssignment.task_id)
@ -49,9 +40,9 @@ def employee_list():
            results.append({
                "employee_id": str(emp.id),
                "name": emp.name,
+                "tier": emp.tier,
                "salary_cents": emp.salary_cents,
                "work_hours_per_day": float(emp.work_hours_per_day),
-                "skills": skill_map,
                "active_task_count": len(active_task_ids),
                "active_task_ids": active_task_ids,
            })
--- a/src/yc_bench/cli/market_commands.py
+++ b/src/yc_bench/cli/market_commands.py
@ -58,7 +58,6 @@ def market_browse(
            results.append({
                "task_id": str(task.id),
                "title": task.title,
-                "description": task.description,
                "required_prestige": task.required_prestige,
                "reward_funds_cents": task.reward_funds_cents,
                "reward_prestige_delta": float(task.reward_prestige_delta),
--- a/src/yc_bench/cli/task_commands.py
+++ b/src/yc_bench/cli/task_commands.py
@ -95,7 +95,6 @@ def task_accept(
            company_id=None,
            status=TaskStatus.MARKET,
            title=replacement.title,
-            description=replacement.description,
            required_prestige=replacement.required_prestige,
            reward_funds_cents=replacement.reward_funds_cents,
            reward_prestige_delta=replacement.reward_prestige_delta,
@ -104,7 +103,7 @@ def task_accept(
            deadline=None,
            completed_at=None,
            success=None,
-            halfway_event_emitted=False,
+            progress_milestone_pct=0,
        )
        db.add(replacement_row)

@ -185,7 +184,7 @@ def task_assign(
                if t and t.status == TaskStatus.ACTIVE:
                    impacted.add(t.id)
            if impacted:
-                recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, half_threshold=_get_world_cfg().task_half_threshold)
+                recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, milestones=_get_world_cfg().task_progress_milestones)

        # Return current assignment list
        assignments = db.query(TaskAssignment).filter(TaskAssignment.task_id == tid).all()
@ -251,7 +250,7 @@ def task_dispatch(
                peer_task = db.query(Task).filter(Task.id == pa.task_id).one_or_none()
                if peer_task and peer_task.status == TaskStatus.ACTIVE:
                    impacted.add(peer_task.id)
-        recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, half_threshold=_get_world_cfg().task_half_threshold)
+        recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, milestones=_get_world_cfg().task_progress_milestones)

        json_output({
            "task_id": str(task.id),
@ -353,7 +352,6 @@ def task_inspect(
        json_output({
            "task_id": str(task.id),
            "title": task.title,
-            "description": task.description,
            "status": task.status.value,
            "required_prestige": task.required_prestige,
            "reward_funds_cents": task.reward_funds_cents,
@ -442,7 +440,7 @@ def task_cancel(
                    if t and t.status == TaskStatus.ACTIVE:
                        impacted.add(t.id)
        if impacted:
-            recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, half_threshold=_get_world_cfg().task_half_threshold)
+            recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, milestones=_get_world_cfg().task_progress_milestones)

        # Bankruptcy check
        company = db.query(Company).filter(Company.id == sim_state.company_id).one()
--- a/src/yc_bench/config/presets/default.toml
+++ b/src/yc_bench/config/presets/default.toml
@ -82,8 +82,8 @@ reward_prestige_scale = 0.55    # hardened: was 0.3
 deadline_qty_per_day   = 320.0    # hardened: was 200.0
 deadline_min_biz_days  = 7

-# --- Progress milestone ---
-task_half_threshold = 0.5
+# --- Progress milestones (checkpoint events at these completion fractions) ---
+task_progress_milestones = [0.25, 0.5, 0.75]

 # --- Business hours ---
 workday_start_hour = 9
@ -161,20 +161,20 @@ share      = 0.50
 min_cents  = 200_000    # $2,000/month
 max_cents  = 400_000    # $4,000/month
 rate_min   = 1.0        # units/hour
-rate_max   = 6.5
+rate_max   = 4.0

 [world.salary_mid]
 name       = "mid"
 share      = 0.35
 min_cents  = 600_000    # $6,000/month
 max_cents  = 800_000    # $8,000/month
-rate_min   = 3.5
-rate_max   = 8.5
+rate_min   = 4.0
+rate_max   = 7.0

 [world.salary_senior]
 name       = "senior"
 share      = 0.15
 min_cents  = 1_000_000  # $10,000/month
 max_cents  = 1_500_000  # $15,000/month
-rate_min   = 5.5
+rate_min   = 7.0
 rate_max   = 10.0
--- a/src/yc_bench/config/schema.py
+++ b/src/yc_bench/config/schema.py
@ -128,8 +128,8 @@ class WorldConfig(BaseModel):
    deadline_qty_per_day: float = 200.0  # work units assumed completable per business day
    deadline_min_biz_days: int = 7

-    # --- Progress milestone ---
-    task_half_threshold: float = 0.5
+    # --- Progress milestones (fraction thresholds that trigger checkpoint events) ---
+    task_progress_milestones: list[float] = Field(default_factory=lambda: [0.25, 0.5, 0.75])

    # --- Business hours ---
    workday_start_hour: int = 9
@ -143,21 +143,21 @@ class WorldConfig(BaseModel):
        default_factory=lambda: SalaryTierConfig(
            name="junior", share=0.50,
            min_cents=200_000, max_cents=400_000,
-            rate_min=1.0, rate_max=6.5,
+            rate_min=1.0, rate_max=4.0,
        )
    )
    salary_mid: SalaryTierConfig = Field(
        default_factory=lambda: SalaryTierConfig(
            name="mid", share=0.35,
            min_cents=600_000, max_cents=800_000,
-            rate_min=3.5, rate_max=8.5,
+            rate_min=4.0, rate_max=7.0,
        )
    )
    salary_senior: SalaryTierConfig = Field(
        default_factory=lambda: SalaryTierConfig(
            name="senior", share=0.15,
            min_cents=1_000_000, max_cents=1_500_000,
-            rate_min=5.5, rate_max=10.0,
+            rate_min=7.0, rate_max=10.0,
        )
    )

--- a/src/yc_bench/core/engine.py
+++ b/src/yc_bench/core/engine.py
@ -74,13 +74,16 @@ def dispatch_event(db: Session, event: SimEvent, sim_time: datetime, company_id:
    """Route event to appropriate handler. Returns result dict."""
    if event.event_type == EventType.TASK_HALF_PROGRESS:
        result = handle_task_half(db, event)
-        return {"type": "task_half", "task_id": str(result.task_id), "handled": result.handled}
+        # Recalculate ETAs so the next milestone is scheduled
+        from ..config import get_world_config
+        recalculate_etas(db, company_id, sim_time, milestones=get_world_config().task_progress_milestones)
+        return {"type": "task_half", "task_id": str(result.task_id), "milestone_pct": result.milestone_pct, "handled": result.handled}

    elif event.event_type == EventType.TASK_COMPLETED:
        result = handle_task_complete(db, event, sim_time)
        # Recalculate ETAs — freed employees change topology
        from ..config import get_world_config
-        recalculate_etas(db, company_id, sim_time, half_threshold=get_world_config().task_half_threshold)
+        recalculate_etas(db, company_id, sim_time, milestones=get_world_config().task_progress_milestones)
        return {
            "type": "task_completed",
            "task_id": str(result.task_id),
--- a/src/yc_bench/core/eta.py
+++ b/src/yc_bench/core/eta.py
@ -185,15 +185,20 @@ def recalculate_etas(
    company_id: UUID,
    now: datetime,
    impacted_task_ids: Optional[Set[UUID]] = None,
+    milestones: Optional[List[float]] = None,
+    # Legacy single-threshold parameter — ignored if milestones is provided.
    half_threshold: float = 0.5,
 ) -> None:
    """Recalculate projection events for active tasks.

    1. Delete stale projection events for impacted tasks (or all if None).
    2. Compute effective rates.
-    3. For each active task, solve completion and halfway times.
+    3. For each active task, solve completion and milestone times.
    4. Insert new projection events.
    """
+    if milestones is None:
+        milestones = [half_threshold]
+
    # Determine which tasks to recalculate
    if impacted_task_ids is None:
        active_tasks = db.query(Task).filter(
@ -240,18 +245,26 @@ def recalculate_etas(
                dedupe_key=f"task:{tid}:completed",
            )

-        # Halfway ETA (only if not already emitted)
-        if not task.halfway_event_emitted:
-            halfway_time = solve_task_halfway_time(db, tid, now, rates, half_threshold=half_threshold)
-            if halfway_time is not None:
+        # Progress milestone ETAs — skip milestones already emitted
+        emitted_pct = task.progress_milestone_pct or 0
+        for milestone in sorted(milestones):
+            milestone_pct = int(milestone * 100)
+            if milestone_pct <= emitted_pct:
+                continue
+            milestone_time = solve_task_halfway_time(db, tid, now, rates, half_threshold=milestone)
+            if milestone_time is not None:
                insert_event(
                    db,
                    company_id=company_id,
                    event_type=EventType.TASK_HALF_PROGRESS,
-                    scheduled_at=halfway_time,
-                    payload={"task_id": str(tid)},
-                    dedupe_key=f"task:{tid}:half",
+                    scheduled_at=milestone_time,
+                    payload={"task_id": str(tid), "milestone_pct": milestone_pct},
+                    dedupe_key=f"task:{tid}:milestone:{milestone_pct}",
                )
+                # Only insert the next upcoming milestone — it will be the
+                # earliest event; once consumed, recalculate_etas runs again
+                # and inserts the following one.
+                break

    db.flush()

--- a/src/yc_bench/core/handlers/task_half.py
+++ b/src/yc_bench/core/handlers/task_half.py
@ -1,4 +1,4 @@
-"""Handler for task_half_progress events."""
+"""Handler for task progress milestone events."""
 from __future__ import annotations

 from dataclasses import dataclass
@ -14,17 +14,19 @@ from ...db.models.task import Task
 class TaskHalfResult:
    task_id: UUID
    handled: bool
+    milestone_pct: int


 def handle_task_half(db: Session, event: SimEvent) -> TaskHalfResult:
-    """Mark the task's halfway_event_emitted flag as True."""
+    """Record the progress milestone on the task."""
    task_id = UUID(event.payload["task_id"])
+    milestone_pct = event.payload.get("milestone_pct", 50)
    task = db.query(Task).filter(Task.id == task_id).one_or_none()

    if task is None:
-        return TaskHalfResult(task_id=task_id, handled=False)
+        return TaskHalfResult(task_id=task_id, handled=False, milestone_pct=milestone_pct)

-    task.halfway_event_emitted = True
+    task.progress_milestone_pct = max(task.progress_milestone_pct or 0, milestone_pct)
    db.flush()

-    return TaskHalfResult(task_id=task_id, handled=True)
+    return TaskHalfResult(task_id=task_id, handled=True, milestone_pct=milestone_pct)
--- a/src/yc_bench/db/models/company.py
+++ b/src/yc_bench/db/models/company.py
@ -10,13 +10,10 @@ from sqlalchemy.orm import mapped_column
 from ..base import Base

 class Domain(str, Enum):
-    SYSTEM = "system"
    RESEARCH = "research"
-    DATA = "data"
-    FRONTEND = "frontend"
-    BACKEND = "backend"
+    INFERENCE = "inference"
+    DATA_ENVIRONMENT = "data_environment"
    TRAINING = "training"
-    HARDWARE = "hardware"

 class Company(Base):
    __tablename__ = "companies"
--- a/src/yc_bench/db/models/employee.py
+++ b/src/yc_bench/db/models/employee.py
@ -30,6 +30,11 @@ class Employee(Base):
        String(255),
        nullable=False,
    )
+    tier = mapped_column(
+        String(20),
+        nullable=False,
+        default="junior",
+    )
    work_hours_per_day = mapped_column(
        Numeric(5, 2),
        nullable=False,
--- a/src/yc_bench/db/models/task.py
+++ b/src/yc_bench/db/models/task.py
@ -45,10 +45,6 @@ class Task(Base):
        String(255),
        nullable=False,
    )
-    description = mapped_column(
-        String,
-        nullable=False,
-    )
    required_prestige = mapped_column(
        Integer,
        nullable=False,
@ -81,11 +77,11 @@ class Task(Base):
        Boolean,
        nullable=True,
    )
-    halfway_event_emitted = mapped_column(
-        Boolean,
+    progress_milestone_pct = mapped_column(
+        Integer,
        nullable=False,
-        default=False,
-        server_default=text("false"),
+        default=0,
+        server_default=text("0"),
    )

 class TaskRequirement(Base):
--- a/src/yc_bench/runner/dashboard.py
+++ b/src/yc_bench/runner/dashboard.py
@ -18,13 +18,10 @@ SPARK_CHARS = "▁▂▃▄▅▆▇█"

 # Domain → (display name, color) for styled inline display
 DOMAIN_STYLE = {
-    "system":   ("System",   "bright_cyan"),
-    "research": ("Research", "bright_magenta"),
-    "data":     ("Data",     "bright_blue"),
-    "frontend": ("Frontend", "bright_yellow"),
-    "backend":  ("Backend",  "bright_green"),
-    "training": ("Training", "red"),
-    "hardware": ("Hardware", "white"),
+    "research":         ("Research",  "bright_magenta"),
+    "inference":        ("Inference", "bright_cyan"),
+    "data_environment": ("Data/Env",  "bright_blue"),
+    "training":         ("Training",  "red"),
 }


@ -132,7 +129,7 @@ def _query_detailed_snapshot(db_factory, company_id) -> dict[str, Any]:
                ]
                deadline_str = t.deadline.strftime("%Y-%m-%d") if t.deadline else "-"
                tasks_detail.append(TaskInfo(
-                    title=t.title,
+                    title=t.title[:20],
                    status=status.value,
                    prestige=t.required_prestige,
                    reward_dollars=t.reward_funds_cents / 100.0,
--- a/src/yc_bench/services/generate_employees.py
+++ b/src/yc_bench/services/generate_employees.py
@ -1,5 +1,6 @@
 from __future__ import annotations

+import math
 from dataclasses import dataclass

 from ..config.schema import WorldConfig
@ -7,6 +8,18 @@ from ..db.models.company import Domain
 from .rng import RngStreams, sample_right_skew_triangular_int

 _ALL_DOMAINS = list(Domain)
+_NUM_DOMAINS = len(_ALL_DOMAINS)
+
+# Fixed tier composition for a 10-person startup.
+# Repeated to cover any employee count via modular indexing.
+_TIER_SEQUENCE = [
+    "junior", "junior", "junior", "junior", "junior",
+    "mid", "mid", "mid",
+    "senior", "senior",
+]
+
+_MIN_RATE = 1.0
+_MAX_RATE = 10.0


@dataclass(frozen=True)
@ -22,16 +35,6 @@ def _salary_tiers(cfg):
    return (cfg.salary_junior, cfg.salary_mid, cfg.salary_senior)


-def _pick_tier_name(rng, cfg):
-    x = rng.random()
-    acc = 0.0
-    for tier in _salary_tiers(cfg):
-        acc += tier.share
-        if acc >= x:
-            return tier.name
-    return _salary_tiers(cfg)[-1].name
-
-
 def _tier_by_name(cfg, tier_name):
    for tier in _salary_tiers(cfg):
        if tier.name == tier_name:
@ -44,10 +47,49 @@ def _sample_salary_cents(rng, cfg, tier_name):
    return sample_right_skew_triangular_int(rng, tier.min_cents, tier.max_cents)


-def _sample_rates_by_domain(rng, cfg, tier_name):
-    tier = _tier_by_name(cfg, tier_name)
-    lo, hi = tier.rate_min, tier.rate_max
-    return {domain: round(rng.uniform(lo, hi), 4) for domain in _ALL_DOMAINS}
+def _dirichlet_sample(rng, alpha, k):
+    """Sample from Dirichlet(alpha, ..., alpha) with k components."""
+    raw = [rng.gammavariate(alpha, 1.0) for _ in range(k)]
+    total = sum(raw)
+    if total == 0:
+        return [1.0 / k] * k
+    return [x / total for x in raw]
+
+
+def _distribute_rates(rng, avg_rate, dirichlet_alpha=0.3):
+    """Distribute a rate budget across domains with spiky concentration.
+
+    Each domain gets at least _MIN_RATE.  The extra budget is split via
+    Dirichlet(alpha) so that one or two domains can be dramatically higher
+    than the rest — a junior can secretly be a superstar in one domain.
+    Individual rates are capped at _MAX_RATE.
+    """
+    total_budget = avg_rate * _NUM_DOMAINS
+    extra = total_budget - _NUM_DOMAINS * _MIN_RATE
+
+    if extra <= 0:
+        return [_MIN_RATE] * _NUM_DOMAINS
+
+    proportions = _dirichlet_sample(rng, dirichlet_alpha, _NUM_DOMAINS)
+    rates = [_MIN_RATE + extra * p for p in proportions]
+
+    # Cap at _MAX_RATE and redistribute excess iteratively.
+    for _ in range(5):
+        overflow = 0.0
+        uncapped = []
+        for i in range(_NUM_DOMAINS):
+            if rates[i] > _MAX_RATE:
+                overflow += rates[i] - _MAX_RATE
+                rates[i] = _MAX_RATE
+            else:
+                uncapped.append(i)
+        if overflow <= 0 or not uncapped:
+            break
+        share = overflow / len(uncapped)
+        for i in uncapped:
+            rates[i] += share
+
+    return [round(r, 4) for r in rates]


 def generate_employees(*, run_seed, count, cfg=None):
@ -56,12 +98,25 @@ def generate_employees(*, run_seed, count, cfg=None):
    if count <= 0:
        return []

-    employees = []
    streams = RngStreams(run_seed)

+    # Build and shuffle tier assignments.
+    tier_rng = streams.stream("tier_assignment")
+    seq_len = len(_TIER_SEQUENCE)
+    tiers = [_TIER_SEQUENCE[i % seq_len] for i in range(count)]
+    tier_rng.shuffle(tiers)
+
+    employees = []
    for idx in range(1, count + 1):
        rng = streams.stream(f"employee_{idx}")
-        tier_name = _pick_tier_name(rng, cfg)
+        tier_name = tiers[idx - 1]
+        tier_cfg = _tier_by_name(cfg, tier_name)
+
+        # Sample average rate uniformly within the tier's range.
+        avg_rate = rng.uniform(tier_cfg.rate_min, tier_cfg.rate_max)
+
+        domain_rates = _distribute_rates(rng, avg_rate)
+        rates = dict(zip(_ALL_DOMAINS, domain_rates))

        employees.append(
            GeneratedEmployee(
@ -69,7 +124,7 @@ def generate_employees(*, run_seed, count, cfg=None):
                work_hours_per_day=cfg.work_hours_per_day,
                salary_cents=_sample_salary_cents(rng, cfg, tier_name),
                tier=tier_name,
-                rates_by_domain=_sample_rates_by_domain(rng, cfg, tier_name),
+                rates_by_domain=rates,
            )
        )
    return employees
--- a/src/yc_bench/services/generate_tasks.py
+++ b/src/yc_bench/services/generate_tasks.py
@ -8,13 +8,11 @@ from ..config.sampling import sample_from_spec
 from ..config.schema import WorldConfig
 from ..db.models.company import Domain
 from .rng import RngStreams, sample_without_replacement
-from .task_catalog import pick_task_text


@dataclass(frozen=True)
 class GeneratedTask:
    title: str
-    description: str
    required_prestige: int
    reward_funds_cents: int
    reward_prestige_delta: float
@ -25,7 +23,7 @@ class GeneratedTask:
    deadline: datetime | None
    completed_at: datetime | None
    success: bool | None
-    halfway_event_emitted: bool
+    progress_milestone_pct: int
    requirements: dict[str, int]


@ -71,18 +69,9 @@ def _sample_requirements(rng, cfg):
    return {domain: _sample_required_qty(rng, cfg) for domain in picked_domains}


-def _pick_title_desc(rng, primary_domain, serial):
-    title, description = pick_task_text(rng, primary_domain)
-    domain_str = primary_domain.value if hasattr(primary_domain, "value") else str(primary_domain)
-    title = f"{title} [{domain_str.upper()}-{serial}]"
-    return title, description
-
-
 def _make_task(rng, cfg, prestige, serial, requirements):
-    title, description = _pick_title_desc(rng, next(iter(requirements)), serial)
    return GeneratedTask(
-        title=title,
-        description=description,
+        title=f"Task-{serial}",
        required_prestige=prestige,
        reward_funds_cents=_sample_reward_funds_cents(rng, cfg, prestige=prestige),
        reward_prestige_delta=_sample_reward_prestige_delta(rng, cfg),
@ -93,7 +82,7 @@ def _make_task(rng, cfg, prestige, serial, requirements):
        deadline=None,
        completed_at=None,
        success=None,
-        halfway_event_emitted=False,
+        progress_milestone_pct=0,
        requirements=requirements,
    )

@ -122,7 +111,6 @@ def build_task_rows(*, run_seed, count, cfg=None):
    for task in generated:
        task_rows.append({
            "title": task.title,
-            "description": task.description,
            "required_prestige": task.required_prestige,
            "reward_funds_cents": task.reward_funds_cents,
            "reward_prestige_delta": task.reward_prestige_delta,
@ -133,7 +121,7 @@ def build_task_rows(*, run_seed, count, cfg=None):
            "deadline": task.deadline,
            "completed_at": task.completed_at,
            "success": task.success,
-            "halfway_event_emitted": task.halfway_event_emitted,
+            "progress_milestone_pct": task.progress_milestone_pct,
        })
        for domain, qty in task.requirements.items():
            requirement_rows.append({
--- a/src/yc_bench/services/seed_world.py
+++ b/src/yc_bench/services/seed_world.py
@ -63,6 +63,7 @@ def _seed_employees(db, company, req):
            id=uuid4(),
            company_id=company.id,
            name=emp.name,
+            tier=emp.tier,
            work_hours_per_day=emp.work_hours_per_day,
            salary_cents=emp.salary_cents,
        )
@ -86,7 +87,6 @@ def _seed_market_tasks(db, company, req):
            company_id=None,
            status=TaskStatus.MARKET,
            title=task.title,
-            description=task.description,
            required_prestige=task.required_prestige,
            reward_funds_cents=task.reward_funds_cents,
            reward_prestige_delta=task.reward_prestige_delta,
@ -95,7 +95,7 @@ def _seed_market_tasks(db, company, req):
            deadline=None,
            completed_at=None,
            success=None,
-            halfway_event_emitted=False,
+            progress_milestone_pct=0,
        )
        db.add(task_row)

--- a/src/yc_bench/services/task_catalog.py
+++ b/src/yc_bench/services/task_catalog.py
@ -1,365 +0,0 @@
-"""Realistic AI-startup task titles and descriptions, keyed by domain.
-
-Each domain has a pool of (title, description) tuples. The generator picks
-from these deterministically using the seeded RNG, cycling if the pool is
-exhausted.
-"""
-from __future__ import annotations
-
-from ..db.models.company import Domain
-
-TASK_POOL: dict[Domain, list[tuple[str, str]]] = {
-    Domain.SYSTEM: [
-        (
-            "Set Up GPU-Aware K8s Cluster with Auto-Scaling",
-            "Deploy a Kubernetes cluster with NVIDIA GPU operator, node auto-scaling based on inference queue depth, and spot instance fallback for training workloads.",
-        ),
-        (
-            "Build CI/CD Pipeline for ML Model Registry",
-            "Create a CI pipeline that runs training validation, pushes versioned model artifacts to a registry, and auto-deploys to a staging inference endpoint.",
-        ),
-        (
-            "Implement Blue-Green Deployment for LLM Serving",
-            "Set up zero-downtime model swaps for a vLLM serving cluster with automated rollback triggered by latency and error-rate thresholds.",
-        ),
-        (
-            "Deploy Observability Stack for AI Workloads",
-            "Stand up Grafana, Prometheus, and OpenTelemetry with custom dashboards tracking GPU utilization, token throughput, time-to-first-token, and per-request cost.",
-        ),
-        (
-            "Terraform Multi-Region Inference Infrastructure",
-            "Write IaC modules to provision inference endpoints across 3+ regions with global load balancing, failover routing, and centralized logging.",
-        ),
-        (
-            "Container Image Optimization for ML Serving",
-            "Reduce Docker image sizes for PyTorch/CUDA serving containers from 15 GB to under 4 GB using multi-stage builds and distroless bases to cut cold-start times.",
-        ),
-        (
-            "Implement Secret Rotation and API Key Management",
-            "Build an automated secret rotation system for API keys, database credentials, and model provider tokens across staging and production environments.",
-        ),
-        (
-            "Set Up Cost Monitoring and GPU Budget Alerts",
-            "Integrate cloud billing APIs with a dashboard showing per-team GPU spend, cost-per-inference breakdowns, and automated alerts when daily spend exceeds thresholds.",
-        ),
-        (
-            "Build Canary Release Pipeline for Embedding Models",
-            "Implement a canary deployment system that gradually shifts traffic to new embedding model versions, comparing retrieval quality metrics in real time.",
-        ),
-        (
-            "Migrate Inference Workloads to Serverless GPU",
-            "Evaluate and migrate bursty inference workloads to serverless GPU providers, benchmarking cold-start latency against always-on instances.",
-        ),
-        (
-            "Implement Disaster Recovery for Training Checkpoints",
-            "Design a cross-region checkpoint backup system with automated integrity verification, ensuring training runs can resume within 15 minutes of any single-region failure.",
-        ),
-        (
-            "Build Internal Developer Platform for ML Engineers",
-            "Create a self-service portal where ML engineers can request GPU instances, spin up Jupyter environments, and launch training jobs without touching infrastructure.",
-        ),
-    ],
-    Domain.RESEARCH: [
-        (
-            "Design Benchmark for Legal Document QA",
-            "Create a benchmark suite of 2,000+ annotated legal questions across contract law and compliance, with human-expert baselines and an automated evaluation harness.",
-        ),
-        (
-            "Investigate MoE Routing for Multilingual Models",
-            "Research and prototype alternative Mixture-of-Experts routing strategies that improve expert utilization for low-resource languages without degrading high-resource performance.",
-        ),
-        (
-            "Reproduce and Extend Speculative Decoding Results",
-            "Replicate speculative decoding paper results on Llama-3 class models, then test novel draft model architectures that improve acceptance rates on code generation.",
-        ),
-        (
-            "Develop RAG Hallucination Detection Framework",
-            "Build a systematic evaluation pipeline measuring faithfulness, relevance, and attribution accuracy for retrieval-augmented generation systems.",
-        ),
-        (
-            "Prototype LoRA Merging for Multi-Tenant Serving",
-            "Research methods for dynamically composing multiple LoRA adapters at inference time, measuring quality degradation versus serving separate fine-tuned models.",
-        ),
-        (
-            "Benchmark Long-Context Retrieval Across 128K Models",
-            "Systematically evaluate needle-in-a-haystack and multi-hop reasoning performance across frontier models at various context lengths with reproducible results.",
-        ),
-        (
-            "Investigate Synthetic Data Quality for Code Generation",
-            "Develop automated quality scoring methods for synthetically generated code training data, correlating filter thresholds with downstream model performance.",
-        ),
-        (
-            "Research KV-Cache Compression Techniques",
-            "Prototype and benchmark KV-cache eviction and quantization strategies for long-running conversational agents under fixed memory budgets.",
-        ),
-        (
-            "Build Ablation Study Framework for Prompt Engineering",
-            "Create an experimentation harness for testing prompt variations across multiple models and tasks with statistical significance testing and cost tracking.",
-        ),
-        (
-            "Explore Constitutional AI for Domain-Specific Safety",
-            "Adapt constitutional AI methods to create a self-improving safety filter for a healthcare chatbot, defining domain-specific principles and measuring accuracy.",
-        ),
-        (
-            "Develop Novel Chunking Strategies for Technical RAG",
-            "Research and benchmark alternative document chunking methods—semantic, AST-aware, sliding window—specifically for API documentation and code repositories.",
-        ),
-        (
-            "Prototype Test-Time Compute Scaling for Math Reasoning",
-            "Implement best-of-N sampling, tree search, and self-verification approaches for math reasoning, measuring the compute-accuracy Pareto frontier.",
-        ),
-    ],
-    Domain.DATA: [
-        (
-            "Build Web Scraping Pipeline for Industry News Corpus",
-            "Design a pipeline that crawls 50+ AI/tech news sources daily, deduplicates articles, extracts structured metadata, and loads clean text into a vector store.",
-        ),
-        (
-            "Create Annotation Platform for Dialogue Quality",
-            "Build an annotation workflow where human raters score LLM conversation logs on helpfulness, accuracy, and safety, with inter-rater agreement tracking.",
-        ),
-        (
-            "Implement PII Detection and Redaction Pipeline",
-            "Deploy a pipeline to detect and redact personally identifiable information from training data, with audit logging and configurable redaction strategies.",
-        ),
-        (
-            "Curate Instruction-Tuning Dataset from Internal Docs",
-            "Extract, clean, and convert 10,000+ pages of internal documentation into high-quality instruction-response pairs suitable for fine-tuning.",
-        ),
-        (
-            "Build Data Quality Monitoring for Feature Store",
-            "Implement data validation checks on streaming feature pipelines, alerting on schema drift, null-rate spikes, and distribution shifts before they affect models.",
-        ),
-        (
-            "Design ETL Pipeline for Multi-Modal Training Data",
-            "Build a DAG pipeline that ingests images, PDFs, and structured data, applies OCR and layout detection, and produces unified records for vision-language training.",
-        ),
-        (
-            "Implement Deduplication for Large Text Corpora",
-            "Deploy MinHash LSH-based near-deduplication at scale for 100M+ documents with configurable similarity thresholds and a review UI for borderline cases.",
-        ),
-        (
-            "Build Synthetic Data Pipeline for Rare Edge Cases",
-            "Create a system that uses frontier LLMs to generate realistic synthetic examples for underrepresented categories in a classification dataset.",
-        ),
-        (
-            "Create Data Versioning and Lineage Tracking System",
-            "Set up data versioning integrated with the ML training pipeline so every model checkpoint can be traced back to the exact dataset snapshot used.",
-        ),
-        (
-            "Build Customer Feedback Loop into Training Pipeline",
-            "Implement a system where end-user thumbs-up/down signals are routed, reviewed, and selectively incorporated into fine-tuning datasets with human approval.",
-        ),
-        (
-            "Migrate Legacy Warehouse to ML-Ready Lakehouse",
-            "Transform and migrate 5 years of product analytics data from a legacy SQL warehouse into a Parquet-based lakehouse optimized for feature engineering.",
-        ),
-    ],
-    Domain.FRONTEND: [
-        (
-            "Build Interactive LLM Playground with Streaming",
-            "Create a web app where users test multiple LLM providers side-by-side with streaming output, adjustable parameters, and conversation history persistence.",
-        ),
-        (
-            "Design Admin Dashboard for AI Agent Monitoring",
-            "Build a dashboard showing real-time agent execution traces, tool call sequences, token usage graphs, and cost breakdowns with drill-down filtering.",
-        ),
-        (
-            "Create Document Chat Interface for RAG Product",
-            "Implement a drag-and-drop document upload UI with a conversational interface showing source citations, confidence indicators, and reference highlighting.",
-        ),
-        (
-            "Build Annotation Review and Approval Interface",
-            "Design a UI for data team leads to review annotator work, resolve disagreements, view agreement stats, and approve batches for training inclusion.",
-        ),
-        (
-            "Implement Prompt Management Studio",
-            "Build a collaborative app where teams version, test, and A/B deploy prompt templates with visual diffs, rollback, and per-version performance analytics.",
-        ),
-        (
-            "Create Customer-Facing AI Usage Analytics Dashboard",
-            "Build an embeddable dashboard showing API call volumes, latency percentiles, token consumption, and cost trends for enterprise customers.",
-        ),
-        (
-            "Build Visual Pipeline Editor for No-Code AI Workflows",
-            "Create a node-based drag-and-drop editor where non-technical users chain data sources, LLM calls, and output actions into automated AI workflows.",
-        ),
-        (
-            "Design Chat Widget for Website Embedding",
-            "Build a lightweight, brandable chat widget under 50 KB that customers embed on their sites, with streaming responses and escalation-to-human capability.",
-        ),
-        (
-            "Build Model Comparison Results Viewer",
-            "Create a web interface displaying benchmark results across models in interactive tables and charts with filtering by task type and model size.",
-        ),
-        (
-            "Implement Real-Time Collaboration for AI Writing Tool",
-            "Add multiplayer editing to an AI writing tool using CRDTs, with per-user cursors, AI suggestion tracking, and version history.",
-        ),
-        (
-            "Create Enterprise RAG Onboarding Wizard",
-            "Build a step-by-step setup wizard guiding enterprise customers through connecting data sources, configuring chunking, testing retrieval, and deploying their endpoint.",
-        ),
-    ],
-    Domain.BACKEND: [
-        (
-            "Build Multi-Tenant LLM Gateway with Rate Limiting",
-            "Implement an API gateway that proxies requests to multiple LLM providers, enforces per-tenant rate limits, tracks usage, and handles automatic failover.",
-        ),
-        (
-            "Implement OAuth2 + SAML SSO for Enterprise Platform",
-            "Add enterprise authentication supporting SAML 2.0, OIDC, and SCIM provisioning for customers integrating with their identity provider.",
-        ),
-        (
-            "Design Webhook System for Async AI Job Completion",
-            "Build a reliable webhook delivery system with exponential backoff, signature verification, dead letter queue, and a webhook management API.",
-        ),
-        (
-            "Create Unified Embedding API with Caching Layer",
-            "Build a microservice abstracting over multiple embedding providers with a Redis-backed cache, batch processing, and automatic model version migration.",
-        ),
-        (
-            "Build Conversation Memory Service for Multi-Session Agents",
-            "Implement a service that stores, summarizes, and retrieves conversation history across sessions using structured storage and semantic vector search.",
-        ),
-        (
-            "Implement Usage-Based Billing with Stripe Integration",
-            "Build a metering system that tracks token consumption per customer, aggregates monthly invoices, and syncs with Stripe for automated usage-based charging.",
-        ),
-        (
-            "Create Plugin Marketplace Backend",
-            "Design the API and data model for a marketplace where third-party developers register, version, and distribute plugins for the AI platform.",
-        ),
-        (
-            "Build RAG Ingestion Service with Chunking and Indexing",
-            "Implement an async document processing service that accepts PDFs, DOCX, and HTML, chunks them, generates embeddings, and upserts into a vector store.",
-        ),
-        (
-            "Implement Audit Logging and Compliance API",
-            "Build a tamper-evident audit log system recording all AI interactions and admin actions, with an API for compliance queries and SOC 2 / HIPAA exports.",
-        ),
-        (
-            "Design Multi-Model Routing and Fallback Service",
-            "Create a smart routing layer directing requests to the optimal model based on task complexity, latency requirements, and cost, with provider failover.",
-        ),
-        (
-            "Build File Processing Service for Vision-Language Models",
-            "Implement an async service that accepts images and documents, runs them through vision-language models for extraction, and returns structured JSON output.",
-        ),
-        (
-            "Implement Streaming API with Server-Sent Events",
-            "Build an SSE-based streaming endpoint for LLM responses with connection resumption, partial response caching, and graceful degradation.",
-        ),
-    ],
-    Domain.TRAINING: [
-        (
-            "Fine-Tune Llama-3 8B for Domain-Specific Support",
-            "Run supervised fine-tuning on 50K curated customer support conversations using QLoRA, targeting 15% accuracy improvement over the base model.",
-        ),
-        (
-            "Implement RLHF Pipeline for Code Generation Model",
-            "Build an end-to-end RLHF pipeline with a reward model trained on human preference data and PPO training loop evaluated against HumanEval.",
-        ),
-        (
-            "Distill GPT-4 Class Model into Efficient 3B Model",
-            "Use knowledge distillation with synthetic data to create a compact model retaining 90%+ teacher performance on targeted tasks at 10x lower inference cost.",
-        ),
-        (
-            "Train Custom Embedding Model for Vertical Search",
-            "Fine-tune a sentence-transformers model on domain-specific query-document pairs with contrastive learning, hard negative mining, and retrieval benchmarks.",
-        ),
-        (
-            "Build Hyperparameter Search for Fine-Tuning Jobs",
-            "Implement an Optuna-based HPO system searching over learning rate, LoRA rank, batch size, and data mixing ratios with early stopping.",
-        ),
-        (
-            "Run Continued Pre-Training on Proprietary Corpus",
-            "Execute continued pre-training of a 7B base model on 10B tokens of domain-specific text with careful learning rate scheduling to avoid catastrophic forgetting.",
-        ),
-        (
-            "Train Reward Model from Preference Annotations",
-            "Collect and process 20K pairwise preference annotations, train a Bradley-Terry reward model, and validate calibration against held-out human judgments.",
-        ),
-        (
-            "Build Multi-GPU Training Infra with DeepSpeed",
-            "Set up distributed training using DeepSpeed ZeRO Stage 3 across an 8-node GPU cluster with checkpoint sharding and fault-tolerant resumption.",
-        ),
-        (
-            "Implement DPO Fine-Tuning Pipeline",
-            "Build a Direct Preference Optimization pipeline as a simpler RLHF alternative, comparing quality and training stability on the same preference dataset.",
-        ),
-        (
-            "Train Vision-Language Adapter for Document Understanding",
-            "Fine-tune a LoRA adapter on a VLM for extracting structured data from invoices, receipts, and forms with 95%+ field-level accuracy.",
-        ),
-        (
-            "Build Eval-Driven Training Loop with Auto Checkpointing",
-            "Implement a training harness that runs benchmarks every N steps, auto-saves the best checkpoint, detects instability, and alerts on loss spikes.",
-        ),
-        (
-            "Fine-Tune Whisper for Industry-Specific Transcription",
-            "Adapt Whisper-large for medical dictation using 500 hours of labeled audio, targeting 30% WER reduction on domain-specific terminology.",
-        ),
-    ],
-    Domain.HARDWARE: [
-        (
-            "Optimize LLM Inference Latency with TensorRT-LLM",
-            "Convert a 70B model to TensorRT-LLM with INT8/FP8 quantization, continuous batching, and paged attention, targeting sub-200ms time-to-first-token.",
-        ),
-        (
-            "Deploy On-Device ML Model for Mobile Classification",
-            "Convert a PyTorch vision model to Core ML and TFLite, optimize with quantization-aware training, and benchmark on iPhone and Pixel hardware.",
-        ),
-        (
-            "Build GPU Cluster Scheduling with Fair-Share Queuing",
-            "Implement a scheduler for a shared GPU cluster enforcing per-team quotas, priority queuing, preemption policies, and utilization-based chargeback.",
-        ),
-        (
-            "Implement Quantization Pipeline (GPTQ/AWQ/GGUF)",
-            "Build an automated pipeline that takes any model, produces GPTQ, AWQ, and GGUF quantized variants, runs quality regression, and publishes passing models.",
-        ),
-        (
-            "Deploy Edge Inference for Real-Time Video Analytics",
-            "Set up an NVIDIA Jetson-based inference node running YOLO and a lightweight LLM for on-premises real-time camera analysis with local data processing.",
-        ),
-        (
-            "Optimize vLLM Serving for Production Workload",
-            "Profile and tune vLLM parameters—max batch size, KV cache, swap space, tensor parallelism—for target throughput at P99 latency SLA.",
-        ),
-        (
-            "Build Multi-GPU Inference with Tensor Parallelism",
-            "Configure and benchmark a 70B+ model serving across 4-8 GPUs with tensor and pipeline parallelism, optimizing throughput versus latency tradeoffs.",
-        ),
-        (
-            "Implement Dynamic Batching for Inference Requests",
-            "Build a request batching layer that groups incoming requests by sequence length and priority, maximizing GPU utilization within per-request latency SLAs.",
-        ),
-        (
-            "Design Hybrid CPU/GPU Inference Architecture",
-            "Architect a system routing lightweight requests to CPU inference and complex requests to GPU instances, reducing overall compute cost by 40%.",
-        ),
-        (
-            "Set Up Triton Inference Server for Multi-Model Serving",
-            "Deploy NVIDIA Triton to serve embedding, reranking, and generation models on shared GPU infrastructure with dynamic batching and concurrency control.",
-        ),
-        (
-            "Build GPU Health Monitoring and Failover System",
-            "Implement a daemon detecting GPU memory errors, thermal throttling, and NVLink degradation, automatically draining affected nodes and redistributing workloads.",
-        ),
-        (
-            "Benchmark Specialized AI Accelerators vs H100",
-            "Evaluate Groq, Cerebras, and custom ASICs against H100 GPUs, producing a cost-per-token and latency comparison with a migration recommendation.",
-        ),
-        (
-            "Implement Speculative Decoding in Production Stack",
-            "Integrate speculative decoding with a small draft model into the existing serving infrastructure, measuring real-world throughput improvement.",
-        ),
-    ],
-}
-
-
-def pick_task_text(rng, domain: Domain) -> tuple[str, str]:
-    """Deterministically pick a (title, description) for *domain* using *rng*."""
-    pool = TASK_POOL[domain]
-    idx = rng.randint(0, len(pool) - 1)
-    return pool[idx]