Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4

This commit is contained in:
Muyu He 2026-03-05 18:12:48 -08:00
parent 6d6f0a855d
commit eb18c5a90c
22 changed files with 226 additions and 864 deletions

453
README.md
View file

@ -2,180 +2,7 @@
A long-horizon deterministic benchmark for LLM agents. The agent plays CEO of an AI startup over a simulated 13 year run, operating exclusively through a CLI tool against a SQLite-backed discrete-event simulation.
The benchmark tests whether agents can manage compounding decisions: prestige specialisation, employee allocation, cash flow, and deadline risk - sustained over hundreds of turns.
---
## Simulation Dynamics
![YC Bench Architecture](imgs/arch.png "Architecture YC-Bench")
<!-- ```
┌─────────────────────────────────────────────────────────────────────────┐
│ AGENT (LLM) │
│ │
│ Observes: company status · employee skills · market tasks · ledger │
│ Acts via: run_command("yc-bench <cmd>") · scratchpad (persistent) │
└───────────────────────┬─────────────────────────────────────────────────┘
│ CLI commands (JSON responses)
┌─────────────────────────────────────────────────────────────────────────┐
│ DISCRETE-EVENT SIMULATION │
│ │
│ ┌─────────────┐ accept ┌──────────┐ assign+dispatch │
│ │ MARKET │ ──────────► │ PLANNED │ ──────────────────► │
│ │ 100 tasks │ └──────────┘ │
│ └─────────────┘ │
│ ▲ replenish ┌──────────────────────┐ │
│ │ │ ACTIVE │ │
│ │ ┌────────────────────────── │ progress flushes │ │
│ │ │ │ every sim-advance │ │
│ │ │ └──────────┬───────────┘ │
│ │ │ ┌───────────────────────────────────┘ │
│ │ │ │ ETA solver fires TASK_COMPLETED event │
│ │ │ ▼ │
│ │ │ ┌────────────────────────────────────────────────────┐ │
│ │ │ │ TASK_COMPLETED handler │ │
│ │ │ │ │ │
│ │ │ │ on_time? YES → +reward_funds +prestige_delta │ │
│ │ │ │ +skill_boost +salary_bump │ │
│ │ │ │ NO → -1.4× prestige_delta (penalty) │ │
│ └───┘ └─────────────────────┬───────────────────────────── ┘ │
│ │ │
│ ┌──────────────────────────────────┘ │
│ │ Monthly payroll (1st biz day) Bankruptcy check (funds < 0)
│ │ Horizon end (13 years) Context truncation (last 20 rounds)│
└──┴──────────────────────────────────────────────────────────────────────┘
``` -->
### Core loop
1. Agent calls `yc-bench sim resume` to advance time to the next event.
2. The engine flushes task progress, fires due events, applies payroll.
3. Agent reads wake events and decides: accept tasks, assign employees, dispatch, cancel.
4. Repeat until bankruptcy or horizon end.
If the agent doesn't call `sim resume` for N consecutive turns (default 10), the loop forces one automatically.
---
## Economy
### Funds
- Start: **$250,000** (`initial_funds_cents = 25_000_000`)
- Payroll deducted on the **first business day of each month**
- Task reward formula: `base × (1 + reward_prestige_scale × (prestige_req 1))`
- Base: triangular sample in [$5K, $100K], mode $30K
- `reward_prestige_scale = 0.55` (default): a prestige-8 task pays ~4.85× more than prestige-1
### Monthly payroll (5 employees, fast_test)
| Tier | Share | Salary/month | Skill rate |
|------|-------|-------------|------------|
| Junior | 50% | $2K$4K | 1.06.5 units/hr |
| Mid | 35% | $6K$8K | 3.58.5 units/hr |
| Senior | 15% | $10K$15K | 5.510.0 units/hr |
Monthly payroll ≈ **$32K** (5 employees). Starting runway ≈ **7.8 months**.
### Task completion rewards
On success:
- Funds += `reward_funds_cents`
- Prestige += `reward_prestige_delta` (beta-distributed, typically 0.11.5) per required domain
- Skill rate += `skill_boost_pct × current_rate` per assigned employee per domain
- Salary += `1% × current_salary` per assigned employee (compounding payroll pressure)
On failure (past deadline):
- Prestige = `1.4 × reward_prestige_delta` per domain
On cancel:
- Prestige = `2.0 × reward_prestige_delta` per domain
---
## Prestige
7 domains: `system · research · data · frontend · backend · training · hardware`
- Range: **[1.0, 10.0]** per domain, starts at 1.0
- Tasks require a minimum prestige level. Agent can only accept tasks where `max(company_prestige) >= required_prestige`.
- Default distribution: mode=4, so most tasks need prestige 35.
- First 10 market tasks are stratified `[1,1,1,1,2,2,2,3,3,4]` to bootstrap progression.
Specialising in 23 domains unlocks progressively higher-reward tasks. Spreading thin keeps you locked at low prestige everywhere.
---
## Employee throughput
Each employee has a skill rate (units/hr) per domain.
When an employee is assigned to N active tasks simultaneously:
```
effective_rate_per_task = base_rate / N
```
Assigning one senior (rate 8.0) to 4 tasks gives 2.0 units/hr each — often worse than a junior focused on one.
Task completion time = `max(remaining[d] / effective_rate[d])` across all required domains.
Deadline = `max(7, total_required_qty / deadline_qty_per_day)` business days.
`deadline_qty_per_day = 200` in both `challenge` and `fast_test`. With 10 employees and 5 focused per domain, team throughput ≈ 230 units/domain/day — achievable for up to ~4 simultaneous tasks.
---
## Agent interface
All commands return JSON to stdout.
### Observe
```bash
yc-bench company status # funds, prestige, runway, payroll
yc-bench employee list # skills, salary, active tasks
yc-bench market browse # available tasks (--limit N --offset N)
yc-bench task list [--status X] # planned|active|completed_*|cancelled
yc-bench task inspect --task-id UUID # progress %, deadline, assignments
yc-bench finance ledger # full transaction history
yc-bench report monthly # P&L per month
yc-bench scratchpad read # persistent notes (survives context truncation)
```
### Act
```bash
yc-bench task accept --task-id UUID # pull from market, set deadline
yc-bench task assign --task-id UUID --employee-id UUID
yc-bench task dispatch --task-id UUID # start work (≥1 assignment required)
yc-bench task cancel --task-id UUID --reason "" # 2× prestige penalty
yc-bench sim resume # advance to next event
yc-bench scratchpad write/append/clear # persistent memory
```
---
## Context management
- **Proactive truncation**: keeps the last 20 conversation rounds before each API call. Older rounds are dropped.
- **Scratchpad**: per-company persistent text in DB. Survives truncation. Use it to store strategy, deadlines, and employee assignments.
---
## Repository layout
```
YC_Bench/
├── src/ # Python package (yc_bench)
├── scripts/ # plot_multi_model.py, run_benchmark.sh
├── logs/ # per-model stdout/stderr logs
├── db/ # SQLite databases (one per model run)
├── results/ # JSON rollout files
├── plots/ # generated PNG charts
├── pyproject.toml
└── README.md
```
The benchmark tests whether agents can manage compounding decisions: prestige specialisation, employee allocation, cash flow, and deadline risk — sustained over hundreds of turns.
---
@ -194,8 +21,6 @@ cd YC_Bench
uv sync
```
No database setup required — the runner auto-creates `db/<config>_<seed>_<model>.db` on first run.
### API key
```bash
@ -206,7 +31,7 @@ OPENROUTER_API_KEY="sk-or-v1-..." # for openrouter/*
OPENAI_API_KEY="sk-..." # for openai/*
```
### Run a single model
### Run
```bash
uv run yc-bench run \
@ -215,65 +40,61 @@ uv run yc-bench run \
--config medium
```
Outputs:
- `db/medium_1_gemini_gemini-3-flash-preview.db` — SQLite simulation state
- `results/yc_bench_result_medium_1_gemini_gemini-3-flash-preview.json` — full rollout + transcript
Outputs a SQLite DB in `db/` and a JSON rollout in `results/`.
### Live dashboard
When running in a terminal, YC-Bench displays an interactive dashboard that updates in-place after each turn:
```
╭──────────────────────────── YC-Bench ────────────────────────────╮
│ Model claude-haiku-4-5-20251001 seed=1 medium │
│ Turn 8 │
│ Sim Date 2025-03-06 -> 2026-01-01 │
│ Elapsed 0h 02m 34s │
│ Funds $186,271.66 -$63,728 ██▇▃▁ │
│ Runway 5.8mo │
│ Tasks 3 active / 3 queued 2 done 1 fail │
│ Team 5 people $31,864.17/mo │
│ Cost $0.0212 (3.7s/turn) │
│ Action yc-bench task dispatch 7 │
│ Status >> Turn 9: waiting for LLM... │
╰──────────────────────────────────────────────────────────────────╯
╭──────────────────────────── Tasks ───────────────────────────────╮
│ >> Build GPU Cluster $64,152 2025-02-03 Research ==== Training ====== │
│ >> Deploy Observability $27,908 2025-01-22 Data ===... │
│ .. Blue-Green Deploy $30,780 2025-03-18 Backend ...... Data ...... │
╰──────────────────────────────────────────────────────────────────╯
╭──────────────────────────── Team ────────────────────────────────╮
│ Alice Chen $2,564 Training===. Frontend==.. Research=... │
│ Bob Martinez $14,947 Backend===. Research==.. Data==.. │
╰──────────────────────────────────────────────────────────────────╯
```
The dashboard shows:
- **Funds sparkline** — visual trend of your cash position over time
- **Color-coded progress bars** per domain on each task (green = done, yellow = partial, red = low)
- **Employee skill bars** — top 3 skills per team member with strength indicators
- **Runway urgency** — green (safe), yellow (low), red blinking (critical)
- **Salary heat** — expensive employees highlighted in red
To disable the dashboard and see raw log output instead:
```bash
uv run yc-bench run --model ... --seed 1 --config medium --no-live
```
When `--no-live` is set (or stdout is not a terminal, e.g. piped to a file), the original logging output is used. Debug logs from LiteLLM/httpx are written to `logs/debug.log` when the dashboard is active.
### Run 5 models in parallel
### Run multiple models in parallel
```bash
bash scripts/run_benchmark.sh --seed 1 --config challenge
```
### Generate the comparison plot
---
## How it works
![YC Bench Architecture](imgs/arch.png "Architecture YC-Bench")
### Core loop
1. Agent calls `yc-bench sim resume` to advance time to the next event.
2. The engine flushes task progress, fires due events, applies payroll.
3. Agent reads wake events and decides: accept tasks, assign employees, dispatch, cancel.
4. Repeat until bankruptcy or horizon end.
The simulation ends on **bankruptcy** (funds < 0 after payroll), **horizon end** (13 years), or **max turns** (if configured). If the agent doesn't call `sim resume` for 10 consecutive turns, the loop forces one automatically.
### Key mechanics
- **Funds**: start at $250K. Monthly payroll is deducted automatically. Task rewards scale with prestige (`base × (1 + 0.55 × (prestige 1))`).
- **4 domains**: `research · inference · data/environment · training`. Each domain tracks prestige independently in [1.0, 10.0].
- **Prestige gating**: tasks require a minimum prestige level. Most tasks need prestige 35, so the agent must climb from 1.0 by completing easier tasks first. First 10 market tasks are stratified `[1,1,1,1,2,2,2,3,3,4]` to bootstrap progression.
- **Employees**: 10 employees across 3 tiers (junior/mid/senior). The agent sees only each employee's tier and salary — not their per-domain skill rates. A junior can secretly be a superstar in one domain, so the agent must infer productivity from task progress observations.
- **Throughput splitting**: an employee assigned to N active tasks has `effective_rate = base_rate / N`. Focus beats breadth.
- **Task success**: on-time completion awards funds + prestige + skill boosts + 1% salary bump (compounding payroll pressure). Late completion penalises prestige (1.4×). Cancellation penalises harder (2.0×).
- **Progress checkpoints**: the agent is woken at 25%, 50%, 75%, and 100% completion — providing data points to estimate employee productivity.
- **Scratchpad**: persistent notes in the DB that survive context truncation (only last 20 conversation rounds are kept).
### Agent CLI
All commands return JSON. The agent interacts via `run_command("yc-bench <cmd>")`.
```bash
uv run python scripts/plot_multi_model.py --seed 1 --config challenge --budget 30
# → plots/funds_curves.png
# Observe
yc-bench company status # funds, prestige, runway
yc-bench employee list # tier, salary, active tasks
yc-bench market browse [--domain X] [--limit N] # available tasks
yc-bench task list [--status X] # your tasks
yc-bench task inspect --task-id UUID # progress, deadline, assignments
yc-bench finance ledger # transaction history
yc-bench report monthly # P&L per month
# Act
yc-bench task accept --task-id UUID # pull from market
yc-bench task assign --task-id UUID --employee-id UUID
yc-bench task dispatch --task-id UUID # start work
yc-bench task cancel --task-id UUID --reason "" # cancel (2× prestige penalty)
yc-bench sim resume # advance time
yc-bench scratchpad write/append/clear # persistent memory
```
---
@ -282,90 +103,15 @@ uv run python scripts/plot_multi_model.py --seed 1 --config challenge --budget 3
Experiment presets live in `src/yc_bench/config/presets/` as TOML files. Pass the preset name via `--config`.
```
src/yc_bench/config/presets/
├── default.toml # 3yr, 10 employees, 500 tasks — base config
├── tutorial.toml # 1yr, 3 employees, 50 tasks — learn the loop
├── easy.toml # 1yr, 5 employees, 100 tasks — throughput awareness
├── medium.toml # 1yr, 5 employees, 150 tasks — prestige strategy
├── hard.toml # 1yr, 7 employees, 200 tasks — precise ETA reasoning
├── nightmare.toml # 1yr, 8 employees, 300 tasks — sustained perfection
├── challenge.toml # 3yr, 5 employees, 200 tasks — long-horizon endurance
└── fast_test.toml # 1yr, 5 employees, 100 tasks — quick iteration
```
| Config | Employees | Tasks | Tests |
|--------|-----------|-------|-------|
| **tutorial** | 3 | 50 | Basic accept→assign→dispatch loop |
| **easy** | 5 | 100 | Throughput awareness |
| **medium** | 5 | 150 | Prestige climbing + domain specialization |
| **hard** | 7 | 200 | Precise ETA reasoning |
| **nightmare** | 8 | 300 | Sustained perfection under compounding payroll |
Each difficulty level tests one additional concept:
| Config | Tests | Key constraint |
|--------|-------|---------------|
| **tutorial** | Basic accept→assign→dispatch loop | All prestige-1, single domain |
| **easy** | Throughput awareness | Don't over-parallelize |
| **medium** | Prestige climbing + domain specialization | 2-domain tasks, prestige mode=3 |
| **hard** | Precise ETA computation | One bad accept degrades in-flight tasks |
| **nightmare** | Sustained perfection under compounding payroll | One failure ≈ fatal, salary bumps 2%/task |
### Key WorldConfig parameters
| Parameter | Default | Controls |
|-----------|---------|---------|
| `initial_funds_cents` | 25_000_000 | Starting cash ($250K) |
| `num_employees` | 5 | Workforce size |
| `num_market_tasks` | 100 | Market pool size |
| `required_prestige_mode` | 4 | Peak of prestige-req distribution |
| `domain_count_mode` | 2 | Most tasks require 2 domains |
| `required_qty_low/mode` | 500 / 1400 | Task work volume (units) |
| `deadline_qty_per_day` | 200 | Units completable per biz day (lower = easier) |
| `deadline_min_biz_days` | 7 | Minimum deadline |
| `penalty_fail_multiplier` | 1.4 | Prestige × this on deadline miss |
| `penalty_cancel_multiplier` | 2.0 | Prestige × this on cancel |
| `reward_prestige_scale` | 0.55 | Extra reward fraction per prestige level above 1 |
| `salary_bump_pct` | 0.01 | Salary raise per employee per completed task |
### AgentConfig
| Parameter | Default | Controls |
|-----------|---------|---------|
| `model` | openrouter/openai/gpt-4o-mini | LLM model string |
| `temperature` | 0.0 | Sampling temperature |
| `history_keep_rounds` | 20 | Conversation rounds kept in context |
### LoopConfig
| Parameter | Default | Controls |
|-----------|---------|---------|
| `auto_advance_after_turns` | 5 | Force sim resume after N turns without one |
| `max_turns` | 50 | Hard cap on agent turns (null = unlimited) |
### Environment overrides
```bash
YC_BENCH_EXPERIMENT=fast_test # select preset
DATABASE_URL=sqlite:///custom.db # SQLite path
```
---
## Terminal conditions
| Condition | Trigger |
|-----------|---------|
| Horizon end | `sim_time >= start_date + horizon_years` |
| Bankruptcy | `funds_cents < 0` after any payroll |
| Error | Agent runtime exception (API failure, exhausted retries) |
| Max turns | `turn_count >= max_turns` (if set) |
---
## What makes it hard
The hardened default is designed so that the obvious strategies fail:
- **Prestige-1 farming** is unprofitable. Most replacement tasks need prestige 35 and pay much more. Farming the bottom locks you out.
- **Single-specialist dominance** is gone. Most tasks need 2 domains. You must allocate across skill combinations.
- **Speculative accepting** is punished. Cancel penalty (2×) exceeds fail penalty (1.4×) so you can't accept everything and drop the losers.
- **Ignoring payroll** causes bankruptcy. ~$32K/month burns your $250K in 7.8 months — but task complexity means you must also pace your accepts.
- **Parallel dispatch** dilutes throughput. Splitting employees across too many tasks extends every deadline — focus beats breadth.
- **Salary bumps compound**. Every task completion raises assigned employee salaries 1%. Payroll creep accelerates over time.
See `default.toml` for the full list of tunable parameters.
---
@ -375,15 +121,15 @@ The hardened default is designed so that the obvious strategies fail:
![3-model comparison](plots/sonnet_vs_gemini.png)
#### Survival rates (at end of year 1)
#### Survival rates
| Config | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
|--------|-----------|----------------|---------|
| **medium** | 3/3 survived | 3/3 survived | 3/3 survived |
| **hard** | 1/3 survived | 2/3 survived | 2/3 survived |
| **nightmare** | 1/3 survived | 3/3 survived | 2/3 survived |
| **medium** | 3/3 | 3/3 | 3/3 |
| **hard** | 1/3 | 2/3 | 2/3 |
| **nightmare** | 1/3 | 3/3 | 2/3 |
#### Final funds at 1-year mark (bankrupt = funds < 0)
#### Final funds (bankrupt = funds < 0)
| Config | Seed | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
|--------|------|-----------|----------------|---------|
@ -399,82 +145,21 @@ The hardened default is designed so that the obvious strategies fail:
**Overall: Gemini 8/9 · GPT-5.2 7/9 · Sonnet 5/9**
### Key findings
#### Key findings
**Gemini leads on consistency (8/9).** Near-perfect win rates on medium (9398%), and the only model to sweep all 3 nightmare seeds. Achieves this without using the scratchpad — purely reactive, high-frequency decision-making.
- **Gemini leads on consistency** (8/9 survival). The only model to sweep all 3 nightmare seeds.
- **GPT-5.2 has the highest ceiling.** Hard seed 3: $43.5M vs Gemini's $21.9M. When it survives, it tends to outperform by a wide margin.
- **Sonnet is high-variance.** Nightmare seed 2: $10.1M (best nightmare result), but 4/9 bankruptcies overall.
- **Win rate predicts survival.** Every run with >58% task win rate survived. Every run below 40% went bankrupt.
**GPT-5.2 excels at hard (2/3, matching Gemini) with the highest absolute returns.** Hard seed 3: $43.5M vs Gemini's $21.9M. Nightmare seed 3: $23.6M vs Gemini's $805K. When GPT-5.2 survives, it tends to outperform by a significant margin.
**Sonnet has the highest ceiling when it works but the lowest floor.** Nightmare seed 2: $10.1M (best nightmare result). But 4/9 bankruptcies — Sonnet fails harder than the others on adverse seeds.
**Hard is the differentiator config.** On easy configs all three survive. On hard/nightmare the strategies diverge sharply. Gemini plays safe and consistent; GPT-5.2 swings big; Sonnet is high-variance.
**Win rate predicts survival.** Every run with >58% task win rate survived. Every run with <40% went bankrupt. Below that threshold, prestige losses from failures outpace gains and lock the agent out of profitable tasks.
### Prestige specialization
#### Prestige specialization
![Prestige radar](plots/prestige_radar.png)
Each radar shows final prestige across 7 domains (1 = center, 10 = edge). Large polygons = the model climbed prestige broadly. Tiny dots near center = bankrupt before gaining any prestige. Pointy shapes = domain specialization.
**Human Devised Rule** (navy dashed) consistently fills the full radar — it methodically maxes prestige everywhere. Among LLMs, **Gemini** builds the most balanced prestige profiles. **GPT-5.2** shows clear specialization on medium (backend/data/frontend high, training untouched). **Sonnet** is bimodal: either maxes everything (medium seed 1) or collapses entirely (nightmare seeds 1 & 3).
### Why models fail
The scratchpad evolution of Sonnet on hard seed 2 tells the full story:
![Sonnet hard seed 2 scratchpad evolution](plots/notepad_hard_2_claude-sonnet-4-6.gif)
Common failure patterns across all bankrupt runs:
1. **Over-parallelization.** Accepting 35 tasks at once, splitting employees across them. Effective rate per task drops below deadline requirements. Sonnet nightmare seed 3 ran 5 tasks simultaneously with 8 employees on turn 13.
2. **No prestige gating.** Accepting prestige-2 tasks when company prestige is 1.0. The task completes late, triggers a 1.4× prestige penalty, and the agent ends up worse than before.
3. **Late adaptation.** Sonnet correctly identifies problems in its scratchpad ("PRESTIGE CRISIS — MARKET LOCK") but only after payroll has consumed the runway. By turn 137 of hard seed 2, all tasks require prestige ≥ 2 but the company is stuck at 1.0 in 6 of 7 domains.
4. **Inconsistent ETA reasoning.** Sonnet's medium seed 2 has a 49% win rate — essentially a coin flip. It understands throughput math in its scratchpad but doesn't consistently apply it when selecting tasks.
---
## Simulation rules
Please cite our work if you find it useful!
- **Business time**: weekdays only, 09:0018:00. No leap years.
- **Money**: stored as integer cents (`BIGINT`). No floating point.
- **Payroll**: fired on the first business day of each month.
- **Event ordering**: deterministic — `(scheduled_at, priority, id)`.
- **Determinism**: all task generation and employee seeding is reproducible given `--seed`.
- **Prestige**: `NUMERIC(6,3)`, hard clamped to `[1.0, 10.0]`.
- **DB reuse**: if a simulation is terminal (bankrupt or horizon reached), re-running with the same DB wipes and reseeds cleanly.
---
## Output format
`results/yc_bench_result_<config>_<seed>_<model>.json`:
```json
{
"session_id": "run-1-openrouter/openai/gpt-4o-mini",
"model": "openrouter/openai/gpt-4o-mini",
"seed": 1,
"horizon_years": 1,
"turns_completed": 46,
"terminal": true,
"terminal_reason": "bankruptcy",
"total_cost_usd": 0.100008,
"started_at": "...",
"ended_at": "...",
"transcript": [
{
"turn": 1,
"timestamp": "...",
"user_input": "## Simulation Start ...",
"agent_output": "Executed 3 tool call(s): ...",
"commands_executed": ["yc-bench company status -> {...}", ...]
}
]
}
```
Please cite our work if you find it useful and interesting!
```bibtex
@misc{collinear-ai2025ycbench,
author = {{Collinear AI}},

View file

@ -344,13 +344,12 @@ def run_bot(config_name: str, seed: int, bot_slug: str, strategy_fn: StrategyFn)
company_id=None,
status=TaskStatus.MARKET,
title=replacement.title,
description=replacement.description,
required_prestige=replacement.required_prestige,
reward_funds_cents=replacement.reward_funds_cents,
reward_prestige_delta=replacement.reward_prestige_delta,
skill_boost_pct=replacement.skill_boost_pct,
accepted_at=None, deadline=None, completed_at=None,
success=None, halfway_event_emitted=False,
success=None, progress_milestone_pct=0,
)
db.add(replacement_row)
for domain, qty in replacement.requirements.items():
@ -375,7 +374,7 @@ def run_bot(config_name: str, seed: int, bot_slug: str, strategy_fn: StrategyFn)
recalculate_etas(db, company_id, sim_state.sim_time,
impacted_task_ids={best_task.id},
half_threshold=world_cfg.task_half_threshold)
milestones=world_cfg.task_progress_milestones)
task_cycles_used += 1

View file

@ -50,8 +50,8 @@ CONFIGS = ["medium", "hard", "nightmare"]
SEEDS = [1, 2, 3]
DIFF_COLORS = {"medium": BLUE, "hard": ORANGE, "nightmare": "#DC2626"}
DOMAINS = ["system", "research", "data", "frontend", "backend", "training", "hardware"]
DOMAIN_LABELS = ["SYS", "RES", "DATA", "FE", "BE", "TRAIN", "HW"]
DOMAINS = ["research", "inference", "data_environment", "training"]
DOMAIN_LABELS = ["RES", "INF", "DATA/ENV", "TRAIN"]
def load_logo_image(height_px=80):

View file

@ -23,13 +23,10 @@ engine = build_engine()
factory = build_session_factory(engine)
DOMAIN_COLORS = {
"training": "#e67e22",
"research": "#3498db",
"backend": "#2ecc71",
"hardware": "#9b59b6",
"data": "#1abc9c",
"frontend": "#e74c3c",
"system": "#95a5a6",
"research": "#3498db",
"inference": "#9b59b6",
"data_environment": "#1abc9c",
"training": "#e67e22",
}
with session_scope(factory) as db:

View file

@ -18,7 +18,7 @@ Your goal is to maximize company prestige and funds over the simulation horizon
### Observe
- `yc-bench company status` funds, prestige, employee count, payroll, bankruptcy risk
- `yc-bench employee list` list all employees with IDs, salaries, skill rates, and current assignments
- `yc-bench employee list` list all employees with IDs, tier (junior/mid/senior), salaries, and current assignments
- `yc-bench market browse [--domain X] [--required-prestige-lte N] [--reward-min-cents N] [--limit N] [--offset N]` browse available tasks (default limit 50; the response includes a `total` field if total > 50, paginate with --offset to see more)
- `yc-bench task list [--status X]` list your tasks (planned, active, completed, cancelled)
- `yc-bench task inspect --task-id <UUID>` detailed task info (requirements, assignments, progress)
@ -106,7 +106,8 @@ def build_turn_context(
tid = ev.get("task_id", "?")
parts.append(f"- Task {tid}: {'SUCCESS' if success else 'FAILED'}")
elif ev_type == "task_half":
parts.append(f"- Task {ev.get('task_id', '?')}: 50% progress reached")
pct = ev.get("milestone_pct", "?")
parts.append(f"- Task {ev.get('task_id', '?')}: {pct}% progress reached")
elif ev_type == "horizon_end":
parts.append("- **Horizon end reached. Simulation complete.**")
elif ev_type == "bankruptcy":

View file

@ -9,7 +9,7 @@ from uuid import UUID
import typer
from ..db.session import build_engine, build_session_factory, session_scope
from ..db.session import build_engine, build_session_factory, init_db, session_scope
app = typer.Typer(name="yc-bench", add_completion=False)
@ -22,6 +22,7 @@ app = typer.Typer(name="yc-bench", add_completion=False)
def get_db():
"""Yield a transactional SQLAlchemy session, commit on success."""
engine = build_engine()
init_db(engine)
factory = build_session_factory(engine)
with session_scope(factory) as session:
yield session

View file

@ -3,7 +3,7 @@ from __future__ import annotations
import typer
from sqlalchemy import func
from ..db.models.employee import Employee, EmployeeSkillRate
from ..db.models.employee import Employee
from ..db.models.task import Task, TaskAssignment, TaskStatus
from ..db.models.sim_state import SimState
from . import get_db, json_output, error_output
@ -25,15 +25,6 @@ def employee_list():
results = []
for emp in employees:
# Skills
skills = db.query(EmployeeSkillRate).filter(
EmployeeSkillRate.employee_id == emp.id
).all()
skill_map = {
s.domain.value: float(s.rate_domain_per_hour)
for s in skills
}
# Current active assignments
active_assignments = (
db.query(TaskAssignment.task_id)
@ -49,9 +40,9 @@ def employee_list():
results.append({
"employee_id": str(emp.id),
"name": emp.name,
"tier": emp.tier,
"salary_cents": emp.salary_cents,
"work_hours_per_day": float(emp.work_hours_per_day),
"skills": skill_map,
"active_task_count": len(active_task_ids),
"active_task_ids": active_task_ids,
})

View file

@ -58,7 +58,6 @@ def market_browse(
results.append({
"task_id": str(task.id),
"title": task.title,
"description": task.description,
"required_prestige": task.required_prestige,
"reward_funds_cents": task.reward_funds_cents,
"reward_prestige_delta": float(task.reward_prestige_delta),

View file

@ -95,7 +95,6 @@ def task_accept(
company_id=None,
status=TaskStatus.MARKET,
title=replacement.title,
description=replacement.description,
required_prestige=replacement.required_prestige,
reward_funds_cents=replacement.reward_funds_cents,
reward_prestige_delta=replacement.reward_prestige_delta,
@ -104,7 +103,7 @@ def task_accept(
deadline=None,
completed_at=None,
success=None,
halfway_event_emitted=False,
progress_milestone_pct=0,
)
db.add(replacement_row)
@ -185,7 +184,7 @@ def task_assign(
if t and t.status == TaskStatus.ACTIVE:
impacted.add(t.id)
if impacted:
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, half_threshold=_get_world_cfg().task_half_threshold)
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, milestones=_get_world_cfg().task_progress_milestones)
# Return current assignment list
assignments = db.query(TaskAssignment).filter(TaskAssignment.task_id == tid).all()
@ -251,7 +250,7 @@ def task_dispatch(
peer_task = db.query(Task).filter(Task.id == pa.task_id).one_or_none()
if peer_task and peer_task.status == TaskStatus.ACTIVE:
impacted.add(peer_task.id)
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, half_threshold=_get_world_cfg().task_half_threshold)
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, milestones=_get_world_cfg().task_progress_milestones)
json_output({
"task_id": str(task.id),
@ -353,7 +352,6 @@ def task_inspect(
json_output({
"task_id": str(task.id),
"title": task.title,
"description": task.description,
"status": task.status.value,
"required_prestige": task.required_prestige,
"reward_funds_cents": task.reward_funds_cents,
@ -442,7 +440,7 @@ def task_cancel(
if t and t.status == TaskStatus.ACTIVE:
impacted.add(t.id)
if impacted:
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, half_threshold=_get_world_cfg().task_half_threshold)
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, milestones=_get_world_cfg().task_progress_milestones)
# Bankruptcy check
company = db.query(Company).filter(Company.id == sim_state.company_id).one()

View file

@ -82,8 +82,8 @@ reward_prestige_scale = 0.55 # hardened: was 0.3
deadline_qty_per_day = 320.0 # hardened: was 200.0
deadline_min_biz_days = 7
# --- Progress milestone ---
task_half_threshold = 0.5
# --- Progress milestones (checkpoint events at these completion fractions) ---
task_progress_milestones = [0.25, 0.5, 0.75]
# --- Business hours ---
workday_start_hour = 9
@ -161,20 +161,20 @@ share = 0.50
min_cents = 200_000 # $2,000/month
max_cents = 400_000 # $4,000/month
rate_min = 1.0 # units/hour
rate_max = 6.5
rate_max = 4.0
[world.salary_mid]
name = "mid"
share = 0.35
min_cents = 600_000 # $6,000/month
max_cents = 800_000 # $8,000/month
rate_min = 3.5
rate_max = 8.5
rate_min = 4.0
rate_max = 7.0
[world.salary_senior]
name = "senior"
share = 0.15
min_cents = 1_000_000 # $10,000/month
max_cents = 1_500_000 # $15,000/month
rate_min = 5.5
rate_min = 7.0
rate_max = 10.0

View file

@ -128,8 +128,8 @@ class WorldConfig(BaseModel):
deadline_qty_per_day: float = 200.0 # work units assumed completable per business day
deadline_min_biz_days: int = 7
# --- Progress milestone ---
task_half_threshold: float = 0.5
# --- Progress milestones (fraction thresholds that trigger checkpoint events) ---
task_progress_milestones: list[float] = Field(default_factory=lambda: [0.25, 0.5, 0.75])
# --- Business hours ---
workday_start_hour: int = 9
@ -143,21 +143,21 @@ class WorldConfig(BaseModel):
default_factory=lambda: SalaryTierConfig(
name="junior", share=0.50,
min_cents=200_000, max_cents=400_000,
rate_min=1.0, rate_max=6.5,
rate_min=1.0, rate_max=4.0,
)
)
salary_mid: SalaryTierConfig = Field(
default_factory=lambda: SalaryTierConfig(
name="mid", share=0.35,
min_cents=600_000, max_cents=800_000,
rate_min=3.5, rate_max=8.5,
rate_min=4.0, rate_max=7.0,
)
)
salary_senior: SalaryTierConfig = Field(
default_factory=lambda: SalaryTierConfig(
name="senior", share=0.15,
min_cents=1_000_000, max_cents=1_500_000,
rate_min=5.5, rate_max=10.0,
rate_min=7.0, rate_max=10.0,
)
)

View file

@ -74,13 +74,16 @@ def dispatch_event(db: Session, event: SimEvent, sim_time: datetime, company_id:
"""Route event to appropriate handler. Returns result dict."""
if event.event_type == EventType.TASK_HALF_PROGRESS:
result = handle_task_half(db, event)
return {"type": "task_half", "task_id": str(result.task_id), "handled": result.handled}
# Recalculate ETAs so the next milestone is scheduled
from ..config import get_world_config
recalculate_etas(db, company_id, sim_time, milestones=get_world_config().task_progress_milestones)
return {"type": "task_half", "task_id": str(result.task_id), "milestone_pct": result.milestone_pct, "handled": result.handled}
elif event.event_type == EventType.TASK_COMPLETED:
result = handle_task_complete(db, event, sim_time)
# Recalculate ETAs — freed employees change topology
from ..config import get_world_config
recalculate_etas(db, company_id, sim_time, half_threshold=get_world_config().task_half_threshold)
recalculate_etas(db, company_id, sim_time, milestones=get_world_config().task_progress_milestones)
return {
"type": "task_completed",
"task_id": str(result.task_id),

View file

@ -185,15 +185,20 @@ def recalculate_etas(
company_id: UUID,
now: datetime,
impacted_task_ids: Optional[Set[UUID]] = None,
milestones: Optional[List[float]] = None,
# Legacy single-threshold parameter — ignored if milestones is provided.
half_threshold: float = 0.5,
) -> None:
"""Recalculate projection events for active tasks.
1. Delete stale projection events for impacted tasks (or all if None).
2. Compute effective rates.
3. For each active task, solve completion and halfway times.
3. For each active task, solve completion and milestone times.
4. Insert new projection events.
"""
if milestones is None:
milestones = [half_threshold]
# Determine which tasks to recalculate
if impacted_task_ids is None:
active_tasks = db.query(Task).filter(
@ -240,18 +245,26 @@ def recalculate_etas(
dedupe_key=f"task:{tid}:completed",
)
# Halfway ETA (only if not already emitted)
if not task.halfway_event_emitted:
halfway_time = solve_task_halfway_time(db, tid, now, rates, half_threshold=half_threshold)
if halfway_time is not None:
# Progress milestone ETAs — skip milestones already emitted
emitted_pct = task.progress_milestone_pct or 0
for milestone in sorted(milestones):
milestone_pct = int(milestone * 100)
if milestone_pct <= emitted_pct:
continue
milestone_time = solve_task_halfway_time(db, tid, now, rates, half_threshold=milestone)
if milestone_time is not None:
insert_event(
db,
company_id=company_id,
event_type=EventType.TASK_HALF_PROGRESS,
scheduled_at=halfway_time,
payload={"task_id": str(tid)},
dedupe_key=f"task:{tid}:half",
scheduled_at=milestone_time,
payload={"task_id": str(tid), "milestone_pct": milestone_pct},
dedupe_key=f"task:{tid}:milestone:{milestone_pct}",
)
# Only insert the next upcoming milestone — it will be the
# earliest event; once consumed, recalculate_etas runs again
# and inserts the following one.
break
db.flush()

View file

@ -1,4 +1,4 @@
"""Handler for task_half_progress events."""
"""Handler for task progress milestone events."""
from __future__ import annotations
from dataclasses import dataclass
@ -14,17 +14,19 @@ from ...db.models.task import Task
class TaskHalfResult:
task_id: UUID
handled: bool
milestone_pct: int
def handle_task_half(db: Session, event: SimEvent) -> TaskHalfResult:
"""Mark the task's halfway_event_emitted flag as True."""
"""Record the progress milestone on the task."""
task_id = UUID(event.payload["task_id"])
milestone_pct = event.payload.get("milestone_pct", 50)
task = db.query(Task).filter(Task.id == task_id).one_or_none()
if task is None:
return TaskHalfResult(task_id=task_id, handled=False)
return TaskHalfResult(task_id=task_id, handled=False, milestone_pct=milestone_pct)
task.halfway_event_emitted = True
task.progress_milestone_pct = max(task.progress_milestone_pct or 0, milestone_pct)
db.flush()
return TaskHalfResult(task_id=task_id, handled=True)
return TaskHalfResult(task_id=task_id, handled=True, milestone_pct=milestone_pct)

View file

@ -10,13 +10,10 @@ from sqlalchemy.orm import mapped_column
from ..base import Base
class Domain(str, Enum):
SYSTEM = "system"
RESEARCH = "research"
DATA = "data"
FRONTEND = "frontend"
BACKEND = "backend"
INFERENCE = "inference"
DATA_ENVIRONMENT = "data_environment"
TRAINING = "training"
HARDWARE = "hardware"
class Company(Base):
__tablename__ = "companies"

View file

@ -30,6 +30,11 @@ class Employee(Base):
String(255),
nullable=False,
)
tier = mapped_column(
String(20),
nullable=False,
default="junior",
)
work_hours_per_day = mapped_column(
Numeric(5, 2),
nullable=False,

View file

@ -45,10 +45,6 @@ class Task(Base):
String(255),
nullable=False,
)
description = mapped_column(
String,
nullable=False,
)
required_prestige = mapped_column(
Integer,
nullable=False,
@ -81,11 +77,11 @@ class Task(Base):
Boolean,
nullable=True,
)
halfway_event_emitted = mapped_column(
Boolean,
progress_milestone_pct = mapped_column(
Integer,
nullable=False,
default=False,
server_default=text("false"),
default=0,
server_default=text("0"),
)
class TaskRequirement(Base):

View file

@ -18,13 +18,10 @@ SPARK_CHARS = "▁▂▃▄▅▆▇█"
# Domain → (display name, color) for styled inline display
DOMAIN_STYLE = {
"system": ("System", "bright_cyan"),
"research": ("Research", "bright_magenta"),
"data": ("Data", "bright_blue"),
"frontend": ("Frontend", "bright_yellow"),
"backend": ("Backend", "bright_green"),
"training": ("Training", "red"),
"hardware": ("Hardware", "white"),
"research": ("Research", "bright_magenta"),
"inference": ("Inference", "bright_cyan"),
"data_environment": ("Data/Env", "bright_blue"),
"training": ("Training", "red"),
}
@ -132,7 +129,7 @@ def _query_detailed_snapshot(db_factory, company_id) -> dict[str, Any]:
]
deadline_str = t.deadline.strftime("%Y-%m-%d") if t.deadline else "-"
tasks_detail.append(TaskInfo(
title=t.title,
title=t.title[:20],
status=status.value,
prestige=t.required_prestige,
reward_dollars=t.reward_funds_cents / 100.0,

View file

@ -1,5 +1,6 @@
from __future__ import annotations
import math
from dataclasses import dataclass
from ..config.schema import WorldConfig
@ -7,6 +8,18 @@ from ..db.models.company import Domain
from .rng import RngStreams, sample_right_skew_triangular_int
_ALL_DOMAINS = list(Domain)
_NUM_DOMAINS = len(_ALL_DOMAINS)
# Fixed tier composition for a 10-person startup.
# Repeated to cover any employee count via modular indexing.
_TIER_SEQUENCE = [
"junior", "junior", "junior", "junior", "junior",
"mid", "mid", "mid",
"senior", "senior",
]
_MIN_RATE = 1.0
_MAX_RATE = 10.0
@dataclass(frozen=True)
@ -22,16 +35,6 @@ def _salary_tiers(cfg):
return (cfg.salary_junior, cfg.salary_mid, cfg.salary_senior)
def _pick_tier_name(rng, cfg):
x = rng.random()
acc = 0.0
for tier in _salary_tiers(cfg):
acc += tier.share
if acc >= x:
return tier.name
return _salary_tiers(cfg)[-1].name
def _tier_by_name(cfg, tier_name):
for tier in _salary_tiers(cfg):
if tier.name == tier_name:
@ -44,10 +47,49 @@ def _sample_salary_cents(rng, cfg, tier_name):
return sample_right_skew_triangular_int(rng, tier.min_cents, tier.max_cents)
def _sample_rates_by_domain(rng, cfg, tier_name):
tier = _tier_by_name(cfg, tier_name)
lo, hi = tier.rate_min, tier.rate_max
return {domain: round(rng.uniform(lo, hi), 4) for domain in _ALL_DOMAINS}
def _dirichlet_sample(rng, alpha, k):
"""Sample from Dirichlet(alpha, ..., alpha) with k components."""
raw = [rng.gammavariate(alpha, 1.0) for _ in range(k)]
total = sum(raw)
if total == 0:
return [1.0 / k] * k
return [x / total for x in raw]
def _distribute_rates(rng, avg_rate, dirichlet_alpha=0.3):
"""Distribute a rate budget across domains with spiky concentration.
Each domain gets at least _MIN_RATE. The extra budget is split via
Dirichlet(alpha) so that one or two domains can be dramatically higher
than the rest a junior can secretly be a superstar in one domain.
Individual rates are capped at _MAX_RATE.
"""
total_budget = avg_rate * _NUM_DOMAINS
extra = total_budget - _NUM_DOMAINS * _MIN_RATE
if extra <= 0:
return [_MIN_RATE] * _NUM_DOMAINS
proportions = _dirichlet_sample(rng, dirichlet_alpha, _NUM_DOMAINS)
rates = [_MIN_RATE + extra * p for p in proportions]
# Cap at _MAX_RATE and redistribute excess iteratively.
for _ in range(5):
overflow = 0.0
uncapped = []
for i in range(_NUM_DOMAINS):
if rates[i] > _MAX_RATE:
overflow += rates[i] - _MAX_RATE
rates[i] = _MAX_RATE
else:
uncapped.append(i)
if overflow <= 0 or not uncapped:
break
share = overflow / len(uncapped)
for i in uncapped:
rates[i] += share
return [round(r, 4) for r in rates]
def generate_employees(*, run_seed, count, cfg=None):
@ -56,12 +98,25 @@ def generate_employees(*, run_seed, count, cfg=None):
if count <= 0:
return []
employees = []
streams = RngStreams(run_seed)
# Build and shuffle tier assignments.
tier_rng = streams.stream("tier_assignment")
seq_len = len(_TIER_SEQUENCE)
tiers = [_TIER_SEQUENCE[i % seq_len] for i in range(count)]
tier_rng.shuffle(tiers)
employees = []
for idx in range(1, count + 1):
rng = streams.stream(f"employee_{idx}")
tier_name = _pick_tier_name(rng, cfg)
tier_name = tiers[idx - 1]
tier_cfg = _tier_by_name(cfg, tier_name)
# Sample average rate uniformly within the tier's range.
avg_rate = rng.uniform(tier_cfg.rate_min, tier_cfg.rate_max)
domain_rates = _distribute_rates(rng, avg_rate)
rates = dict(zip(_ALL_DOMAINS, domain_rates))
employees.append(
GeneratedEmployee(
@ -69,7 +124,7 @@ def generate_employees(*, run_seed, count, cfg=None):
work_hours_per_day=cfg.work_hours_per_day,
salary_cents=_sample_salary_cents(rng, cfg, tier_name),
tier=tier_name,
rates_by_domain=_sample_rates_by_domain(rng, cfg, tier_name),
rates_by_domain=rates,
)
)
return employees

View file

@ -8,13 +8,11 @@ from ..config.sampling import sample_from_spec
from ..config.schema import WorldConfig
from ..db.models.company import Domain
from .rng import RngStreams, sample_without_replacement
from .task_catalog import pick_task_text
@dataclass(frozen=True)
class GeneratedTask:
title: str
description: str
required_prestige: int
reward_funds_cents: int
reward_prestige_delta: float
@ -25,7 +23,7 @@ class GeneratedTask:
deadline: datetime | None
completed_at: datetime | None
success: bool | None
halfway_event_emitted: bool
progress_milestone_pct: int
requirements: dict[str, int]
@ -71,18 +69,9 @@ def _sample_requirements(rng, cfg):
return {domain: _sample_required_qty(rng, cfg) for domain in picked_domains}
def _pick_title_desc(rng, primary_domain, serial):
title, description = pick_task_text(rng, primary_domain)
domain_str = primary_domain.value if hasattr(primary_domain, "value") else str(primary_domain)
title = f"{title} [{domain_str.upper()}-{serial}]"
return title, description
def _make_task(rng, cfg, prestige, serial, requirements):
title, description = _pick_title_desc(rng, next(iter(requirements)), serial)
return GeneratedTask(
title=title,
description=description,
title=f"Task-{serial}",
required_prestige=prestige,
reward_funds_cents=_sample_reward_funds_cents(rng, cfg, prestige=prestige),
reward_prestige_delta=_sample_reward_prestige_delta(rng, cfg),
@ -93,7 +82,7 @@ def _make_task(rng, cfg, prestige, serial, requirements):
deadline=None,
completed_at=None,
success=None,
halfway_event_emitted=False,
progress_milestone_pct=0,
requirements=requirements,
)
@ -122,7 +111,6 @@ def build_task_rows(*, run_seed, count, cfg=None):
for task in generated:
task_rows.append({
"title": task.title,
"description": task.description,
"required_prestige": task.required_prestige,
"reward_funds_cents": task.reward_funds_cents,
"reward_prestige_delta": task.reward_prestige_delta,
@ -133,7 +121,7 @@ def build_task_rows(*, run_seed, count, cfg=None):
"deadline": task.deadline,
"completed_at": task.completed_at,
"success": task.success,
"halfway_event_emitted": task.halfway_event_emitted,
"progress_milestone_pct": task.progress_milestone_pct,
})
for domain, qty in task.requirements.items():
requirement_rows.append({

View file

@ -63,6 +63,7 @@ def _seed_employees(db, company, req):
id=uuid4(),
company_id=company.id,
name=emp.name,
tier=emp.tier,
work_hours_per_day=emp.work_hours_per_day,
salary_cents=emp.salary_cents,
)
@ -86,7 +87,6 @@ def _seed_market_tasks(db, company, req):
company_id=None,
status=TaskStatus.MARKET,
title=task.title,
description=task.description,
required_prestige=task.required_prestige,
reward_funds_cents=task.reward_funds_cents,
reward_prestige_delta=task.reward_prestige_delta,
@ -95,7 +95,7 @@ def _seed_market_tasks(db, company, req):
deadline=None,
completed_at=None,
success=None,
halfway_event_emitted=False,
progress_milestone_pct=0,
)
db.add(task_row)

View file

@ -1,365 +0,0 @@
"""Realistic AI-startup task titles and descriptions, keyed by domain.
Each domain has a pool of (title, description) tuples. The generator picks
from these deterministically using the seeded RNG, cycling if the pool is
exhausted.
"""
from __future__ import annotations
from ..db.models.company import Domain
TASK_POOL: dict[Domain, list[tuple[str, str]]] = {
Domain.SYSTEM: [
(
"Set Up GPU-Aware K8s Cluster with Auto-Scaling",
"Deploy a Kubernetes cluster with NVIDIA GPU operator, node auto-scaling based on inference queue depth, and spot instance fallback for training workloads.",
),
(
"Build CI/CD Pipeline for ML Model Registry",
"Create a CI pipeline that runs training validation, pushes versioned model artifacts to a registry, and auto-deploys to a staging inference endpoint.",
),
(
"Implement Blue-Green Deployment for LLM Serving",
"Set up zero-downtime model swaps for a vLLM serving cluster with automated rollback triggered by latency and error-rate thresholds.",
),
(
"Deploy Observability Stack for AI Workloads",
"Stand up Grafana, Prometheus, and OpenTelemetry with custom dashboards tracking GPU utilization, token throughput, time-to-first-token, and per-request cost.",
),
(
"Terraform Multi-Region Inference Infrastructure",
"Write IaC modules to provision inference endpoints across 3+ regions with global load balancing, failover routing, and centralized logging.",
),
(
"Container Image Optimization for ML Serving",
"Reduce Docker image sizes for PyTorch/CUDA serving containers from 15 GB to under 4 GB using multi-stage builds and distroless bases to cut cold-start times.",
),
(
"Implement Secret Rotation and API Key Management",
"Build an automated secret rotation system for API keys, database credentials, and model provider tokens across staging and production environments.",
),
(
"Set Up Cost Monitoring and GPU Budget Alerts",
"Integrate cloud billing APIs with a dashboard showing per-team GPU spend, cost-per-inference breakdowns, and automated alerts when daily spend exceeds thresholds.",
),
(
"Build Canary Release Pipeline for Embedding Models",
"Implement a canary deployment system that gradually shifts traffic to new embedding model versions, comparing retrieval quality metrics in real time.",
),
(
"Migrate Inference Workloads to Serverless GPU",
"Evaluate and migrate bursty inference workloads to serverless GPU providers, benchmarking cold-start latency against always-on instances.",
),
(
"Implement Disaster Recovery for Training Checkpoints",
"Design a cross-region checkpoint backup system with automated integrity verification, ensuring training runs can resume within 15 minutes of any single-region failure.",
),
(
"Build Internal Developer Platform for ML Engineers",
"Create a self-service portal where ML engineers can request GPU instances, spin up Jupyter environments, and launch training jobs without touching infrastructure.",
),
],
Domain.RESEARCH: [
(
"Design Benchmark for Legal Document QA",
"Create a benchmark suite of 2,000+ annotated legal questions across contract law and compliance, with human-expert baselines and an automated evaluation harness.",
),
(
"Investigate MoE Routing for Multilingual Models",
"Research and prototype alternative Mixture-of-Experts routing strategies that improve expert utilization for low-resource languages without degrading high-resource performance.",
),
(
"Reproduce and Extend Speculative Decoding Results",
"Replicate speculative decoding paper results on Llama-3 class models, then test novel draft model architectures that improve acceptance rates on code generation.",
),
(
"Develop RAG Hallucination Detection Framework",
"Build a systematic evaluation pipeline measuring faithfulness, relevance, and attribution accuracy for retrieval-augmented generation systems.",
),
(
"Prototype LoRA Merging for Multi-Tenant Serving",
"Research methods for dynamically composing multiple LoRA adapters at inference time, measuring quality degradation versus serving separate fine-tuned models.",
),
(
"Benchmark Long-Context Retrieval Across 128K Models",
"Systematically evaluate needle-in-a-haystack and multi-hop reasoning performance across frontier models at various context lengths with reproducible results.",
),
(
"Investigate Synthetic Data Quality for Code Generation",
"Develop automated quality scoring methods for synthetically generated code training data, correlating filter thresholds with downstream model performance.",
),
(
"Research KV-Cache Compression Techniques",
"Prototype and benchmark KV-cache eviction and quantization strategies for long-running conversational agents under fixed memory budgets.",
),
(
"Build Ablation Study Framework for Prompt Engineering",
"Create an experimentation harness for testing prompt variations across multiple models and tasks with statistical significance testing and cost tracking.",
),
(
"Explore Constitutional AI for Domain-Specific Safety",
"Adapt constitutional AI methods to create a self-improving safety filter for a healthcare chatbot, defining domain-specific principles and measuring accuracy.",
),
(
"Develop Novel Chunking Strategies for Technical RAG",
"Research and benchmark alternative document chunking methods—semantic, AST-aware, sliding window—specifically for API documentation and code repositories.",
),
(
"Prototype Test-Time Compute Scaling for Math Reasoning",
"Implement best-of-N sampling, tree search, and self-verification approaches for math reasoning, measuring the compute-accuracy Pareto frontier.",
),
],
Domain.DATA: [
(
"Build Web Scraping Pipeline for Industry News Corpus",
"Design a pipeline that crawls 50+ AI/tech news sources daily, deduplicates articles, extracts structured metadata, and loads clean text into a vector store.",
),
(
"Create Annotation Platform for Dialogue Quality",
"Build an annotation workflow where human raters score LLM conversation logs on helpfulness, accuracy, and safety, with inter-rater agreement tracking.",
),
(
"Implement PII Detection and Redaction Pipeline",
"Deploy a pipeline to detect and redact personally identifiable information from training data, with audit logging and configurable redaction strategies.",
),
(
"Curate Instruction-Tuning Dataset from Internal Docs",
"Extract, clean, and convert 10,000+ pages of internal documentation into high-quality instruction-response pairs suitable for fine-tuning.",
),
(
"Build Data Quality Monitoring for Feature Store",
"Implement data validation checks on streaming feature pipelines, alerting on schema drift, null-rate spikes, and distribution shifts before they affect models.",
),
(
"Design ETL Pipeline for Multi-Modal Training Data",
"Build a DAG pipeline that ingests images, PDFs, and structured data, applies OCR and layout detection, and produces unified records for vision-language training.",
),
(
"Implement Deduplication for Large Text Corpora",
"Deploy MinHash LSH-based near-deduplication at scale for 100M+ documents with configurable similarity thresholds and a review UI for borderline cases.",
),
(
"Build Synthetic Data Pipeline for Rare Edge Cases",
"Create a system that uses frontier LLMs to generate realistic synthetic examples for underrepresented categories in a classification dataset.",
),
(
"Create Data Versioning and Lineage Tracking System",
"Set up data versioning integrated with the ML training pipeline so every model checkpoint can be traced back to the exact dataset snapshot used.",
),
(
"Build Customer Feedback Loop into Training Pipeline",
"Implement a system where end-user thumbs-up/down signals are routed, reviewed, and selectively incorporated into fine-tuning datasets with human approval.",
),
(
"Migrate Legacy Warehouse to ML-Ready Lakehouse",
"Transform and migrate 5 years of product analytics data from a legacy SQL warehouse into a Parquet-based lakehouse optimized for feature engineering.",
),
],
Domain.FRONTEND: [
(
"Build Interactive LLM Playground with Streaming",
"Create a web app where users test multiple LLM providers side-by-side with streaming output, adjustable parameters, and conversation history persistence.",
),
(
"Design Admin Dashboard for AI Agent Monitoring",
"Build a dashboard showing real-time agent execution traces, tool call sequences, token usage graphs, and cost breakdowns with drill-down filtering.",
),
(
"Create Document Chat Interface for RAG Product",
"Implement a drag-and-drop document upload UI with a conversational interface showing source citations, confidence indicators, and reference highlighting.",
),
(
"Build Annotation Review and Approval Interface",
"Design a UI for data team leads to review annotator work, resolve disagreements, view agreement stats, and approve batches for training inclusion.",
),
(
"Implement Prompt Management Studio",
"Build a collaborative app where teams version, test, and A/B deploy prompt templates with visual diffs, rollback, and per-version performance analytics.",
),
(
"Create Customer-Facing AI Usage Analytics Dashboard",
"Build an embeddable dashboard showing API call volumes, latency percentiles, token consumption, and cost trends for enterprise customers.",
),
(
"Build Visual Pipeline Editor for No-Code AI Workflows",
"Create a node-based drag-and-drop editor where non-technical users chain data sources, LLM calls, and output actions into automated AI workflows.",
),
(
"Design Chat Widget for Website Embedding",
"Build a lightweight, brandable chat widget under 50 KB that customers embed on their sites, with streaming responses and escalation-to-human capability.",
),
(
"Build Model Comparison Results Viewer",
"Create a web interface displaying benchmark results across models in interactive tables and charts with filtering by task type and model size.",
),
(
"Implement Real-Time Collaboration for AI Writing Tool",
"Add multiplayer editing to an AI writing tool using CRDTs, with per-user cursors, AI suggestion tracking, and version history.",
),
(
"Create Enterprise RAG Onboarding Wizard",
"Build a step-by-step setup wizard guiding enterprise customers through connecting data sources, configuring chunking, testing retrieval, and deploying their endpoint.",
),
],
Domain.BACKEND: [
(
"Build Multi-Tenant LLM Gateway with Rate Limiting",
"Implement an API gateway that proxies requests to multiple LLM providers, enforces per-tenant rate limits, tracks usage, and handles automatic failover.",
),
(
"Implement OAuth2 + SAML SSO for Enterprise Platform",
"Add enterprise authentication supporting SAML 2.0, OIDC, and SCIM provisioning for customers integrating with their identity provider.",
),
(
"Design Webhook System for Async AI Job Completion",
"Build a reliable webhook delivery system with exponential backoff, signature verification, dead letter queue, and a webhook management API.",
),
(
"Create Unified Embedding API with Caching Layer",
"Build a microservice abstracting over multiple embedding providers with a Redis-backed cache, batch processing, and automatic model version migration.",
),
(
"Build Conversation Memory Service for Multi-Session Agents",
"Implement a service that stores, summarizes, and retrieves conversation history across sessions using structured storage and semantic vector search.",
),
(
"Implement Usage-Based Billing with Stripe Integration",
"Build a metering system that tracks token consumption per customer, aggregates monthly invoices, and syncs with Stripe for automated usage-based charging.",
),
(
"Create Plugin Marketplace Backend",
"Design the API and data model for a marketplace where third-party developers register, version, and distribute plugins for the AI platform.",
),
(
"Build RAG Ingestion Service with Chunking and Indexing",
"Implement an async document processing service that accepts PDFs, DOCX, and HTML, chunks them, generates embeddings, and upserts into a vector store.",
),
(
"Implement Audit Logging and Compliance API",
"Build a tamper-evident audit log system recording all AI interactions and admin actions, with an API for compliance queries and SOC 2 / HIPAA exports.",
),
(
"Design Multi-Model Routing and Fallback Service",
"Create a smart routing layer directing requests to the optimal model based on task complexity, latency requirements, and cost, with provider failover.",
),
(
"Build File Processing Service for Vision-Language Models",
"Implement an async service that accepts images and documents, runs them through vision-language models for extraction, and returns structured JSON output.",
),
(
"Implement Streaming API with Server-Sent Events",
"Build an SSE-based streaming endpoint for LLM responses with connection resumption, partial response caching, and graceful degradation.",
),
],
Domain.TRAINING: [
(
"Fine-Tune Llama-3 8B for Domain-Specific Support",
"Run supervised fine-tuning on 50K curated customer support conversations using QLoRA, targeting 15% accuracy improvement over the base model.",
),
(
"Implement RLHF Pipeline for Code Generation Model",
"Build an end-to-end RLHF pipeline with a reward model trained on human preference data and PPO training loop evaluated against HumanEval.",
),
(
"Distill GPT-4 Class Model into Efficient 3B Model",
"Use knowledge distillation with synthetic data to create a compact model retaining 90%+ teacher performance on targeted tasks at 10x lower inference cost.",
),
(
"Train Custom Embedding Model for Vertical Search",
"Fine-tune a sentence-transformers model on domain-specific query-document pairs with contrastive learning, hard negative mining, and retrieval benchmarks.",
),
(
"Build Hyperparameter Search for Fine-Tuning Jobs",
"Implement an Optuna-based HPO system searching over learning rate, LoRA rank, batch size, and data mixing ratios with early stopping.",
),
(
"Run Continued Pre-Training on Proprietary Corpus",
"Execute continued pre-training of a 7B base model on 10B tokens of domain-specific text with careful learning rate scheduling to avoid catastrophic forgetting.",
),
(
"Train Reward Model from Preference Annotations",
"Collect and process 20K pairwise preference annotations, train a Bradley-Terry reward model, and validate calibration against held-out human judgments.",
),
(
"Build Multi-GPU Training Infra with DeepSpeed",
"Set up distributed training using DeepSpeed ZeRO Stage 3 across an 8-node GPU cluster with checkpoint sharding and fault-tolerant resumption.",
),
(
"Implement DPO Fine-Tuning Pipeline",
"Build a Direct Preference Optimization pipeline as a simpler RLHF alternative, comparing quality and training stability on the same preference dataset.",
),
(
"Train Vision-Language Adapter for Document Understanding",
"Fine-tune a LoRA adapter on a VLM for extracting structured data from invoices, receipts, and forms with 95%+ field-level accuracy.",
),
(
"Build Eval-Driven Training Loop with Auto Checkpointing",
"Implement a training harness that runs benchmarks every N steps, auto-saves the best checkpoint, detects instability, and alerts on loss spikes.",
),
(
"Fine-Tune Whisper for Industry-Specific Transcription",
"Adapt Whisper-large for medical dictation using 500 hours of labeled audio, targeting 30% WER reduction on domain-specific terminology.",
),
],
Domain.HARDWARE: [
(
"Optimize LLM Inference Latency with TensorRT-LLM",
"Convert a 70B model to TensorRT-LLM with INT8/FP8 quantization, continuous batching, and paged attention, targeting sub-200ms time-to-first-token.",
),
(
"Deploy On-Device ML Model for Mobile Classification",
"Convert a PyTorch vision model to Core ML and TFLite, optimize with quantization-aware training, and benchmark on iPhone and Pixel hardware.",
),
(
"Build GPU Cluster Scheduling with Fair-Share Queuing",
"Implement a scheduler for a shared GPU cluster enforcing per-team quotas, priority queuing, preemption policies, and utilization-based chargeback.",
),
(
"Implement Quantization Pipeline (GPTQ/AWQ/GGUF)",
"Build an automated pipeline that takes any model, produces GPTQ, AWQ, and GGUF quantized variants, runs quality regression, and publishes passing models.",
),
(
"Deploy Edge Inference for Real-Time Video Analytics",
"Set up an NVIDIA Jetson-based inference node running YOLO and a lightweight LLM for on-premises real-time camera analysis with local data processing.",
),
(
"Optimize vLLM Serving for Production Workload",
"Profile and tune vLLM parameters—max batch size, KV cache, swap space, tensor parallelism—for target throughput at P99 latency SLA.",
),
(
"Build Multi-GPU Inference with Tensor Parallelism",
"Configure and benchmark a 70B+ model serving across 4-8 GPUs with tensor and pipeline parallelism, optimizing throughput versus latency tradeoffs.",
),
(
"Implement Dynamic Batching for Inference Requests",
"Build a request batching layer that groups incoming requests by sequence length and priority, maximizing GPU utilization within per-request latency SLAs.",
),
(
"Design Hybrid CPU/GPU Inference Architecture",
"Architect a system routing lightweight requests to CPU inference and complex requests to GPU instances, reducing overall compute cost by 40%.",
),
(
"Set Up Triton Inference Server for Multi-Model Serving",
"Deploy NVIDIA Triton to serve embedding, reranking, and generation models on shared GPU infrastructure with dynamic batching and concurrency control.",
),
(
"Build GPU Health Monitoring and Failover System",
"Implement a daemon detecting GPU memory errors, thermal throttling, and NVLink degradation, automatically draining affected nodes and redistributing workloads.",
),
(
"Benchmark Specialized AI Accelerators vs H100",
"Evaluate Groq, Cerebras, and custom ASICs against H100 GPUs, producing a cost-per-token and latency comparison with a migration recommendation.",
),
(
"Implement Speculative Decoding in Production Stack",
"Integrate speculative decoding with a small draft model into the existing serving infrastructure, measuring real-world throughput improvement.",
),
],
}
def pick_task_text(rng, domain: Domain) -> tuple[str, str]:
"""Deterministically pick a (title, description) for *domain* using *rng*."""
pool = TASK_POOL[domain]
idx = rng.randint(0, len(pool) - 1)
return pool[idx]