mirror of
https://github.com/collinear-ai/yc-bench.git
synced 2026-04-19 12:58:03 +00:00
Updated backend to calculate employee tier with spiky skill distribution; simplified domain count to 4
This commit is contained in:
parent
6d6f0a855d
commit
eb18c5a90c
22 changed files with 226 additions and 864 deletions
453
README.md
453
README.md
|
|
@ -2,180 +2,7 @@
|
|||
|
||||
A long-horizon deterministic benchmark for LLM agents. The agent plays CEO of an AI startup over a simulated 1–3 year run, operating exclusively through a CLI tool against a SQLite-backed discrete-event simulation.
|
||||
|
||||
The benchmark tests whether agents can manage compounding decisions: prestige specialisation, employee allocation, cash flow, and deadline risk - sustained over hundreds of turns.
|
||||
|
||||
---
|
||||
|
||||
## Simulation Dynamics
|
||||
|
||||

|
||||
|
||||
<!-- ```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ AGENT (LLM) │
|
||||
│ │
|
||||
│ Observes: company status · employee skills · market tasks · ledger │
|
||||
│ Acts via: run_command("yc-bench <cmd>") · scratchpad (persistent) │
|
||||
└───────────────────────┬─────────────────────────────────────────────────┘
|
||||
│ CLI commands (JSON responses)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ DISCRETE-EVENT SIMULATION │
|
||||
│ │
|
||||
│ ┌─────────────┐ accept ┌──────────┐ assign+dispatch │
|
||||
│ │ MARKET │ ──────────► │ PLANNED │ ──────────────────► │
|
||||
│ │ 100 tasks │ └──────────┘ │
|
||||
│ └─────────────┘ │
|
||||
│ ▲ replenish ┌──────────────────────┐ │
|
||||
│ │ │ ACTIVE │ │
|
||||
│ │ ┌────────────────────────── │ progress flushes │ │
|
||||
│ │ │ │ every sim-advance │ │
|
||||
│ │ │ └──────────┬───────────┘ │
|
||||
│ │ │ ┌───────────────────────────────────┘ │
|
||||
│ │ │ │ ETA solver fires TASK_COMPLETED event │
|
||||
│ │ │ ▼ │
|
||||
│ │ │ ┌────────────────────────────────────────────────────┐ │
|
||||
│ │ │ │ TASK_COMPLETED handler │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ │ │ on_time? YES → +reward_funds +prestige_delta │ │
|
||||
│ │ │ │ +skill_boost +salary_bump │ │
|
||||
│ │ │ │ NO → -1.4× prestige_delta (penalty) │ │
|
||||
│ └───┘ └─────────────────────┬───────────────────────────── ┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────────────────────────┘ │
|
||||
│ │ Monthly payroll (1st biz day) Bankruptcy check (funds < 0) │
|
||||
│ │ Horizon end (1–3 years) Context truncation (last 20 rounds)│
|
||||
└──┴──────────────────────────────────────────────────────────────────────┘
|
||||
``` -->
|
||||
|
||||
### Core loop
|
||||
|
||||
1. Agent calls `yc-bench sim resume` to advance time to the next event.
|
||||
2. The engine flushes task progress, fires due events, applies payroll.
|
||||
3. Agent reads wake events and decides: accept tasks, assign employees, dispatch, cancel.
|
||||
4. Repeat until bankruptcy or horizon end.
|
||||
|
||||
If the agent doesn't call `sim resume` for N consecutive turns (default 10), the loop forces one automatically.
|
||||
|
||||
---
|
||||
|
||||
## Economy
|
||||
|
||||
### Funds
|
||||
|
||||
- Start: **$250,000** (`initial_funds_cents = 25_000_000`)
|
||||
- Payroll deducted on the **first business day of each month**
|
||||
- Task reward formula: `base × (1 + reward_prestige_scale × (prestige_req − 1))`
|
||||
- Base: triangular sample in [$5K, $100K], mode $30K
|
||||
- `reward_prestige_scale = 0.55` (default): a prestige-8 task pays ~4.85× more than prestige-1
|
||||
|
||||
### Monthly payroll (5 employees, fast_test)
|
||||
|
||||
| Tier | Share | Salary/month | Skill rate |
|
||||
|------|-------|-------------|------------|
|
||||
| Junior | 50% | $2K–$4K | 1.0–6.5 units/hr |
|
||||
| Mid | 35% | $6K–$8K | 3.5–8.5 units/hr |
|
||||
| Senior | 15% | $10K–$15K | 5.5–10.0 units/hr |
|
||||
|
||||
Monthly payroll ≈ **$32K** (5 employees). Starting runway ≈ **7.8 months**.
|
||||
|
||||
### Task completion rewards
|
||||
|
||||
On success:
|
||||
- Funds += `reward_funds_cents`
|
||||
- Prestige += `reward_prestige_delta` (beta-distributed, typically 0.1–1.5) per required domain
|
||||
- Skill rate += `skill_boost_pct × current_rate` per assigned employee per domain
|
||||
- Salary += `1% × current_salary` per assigned employee (compounding payroll pressure)
|
||||
|
||||
On failure (past deadline):
|
||||
- Prestige −= `1.4 × reward_prestige_delta` per domain
|
||||
|
||||
On cancel:
|
||||
- Prestige −= `2.0 × reward_prestige_delta` per domain
|
||||
|
||||
---
|
||||
|
||||
## Prestige
|
||||
|
||||
7 domains: `system · research · data · frontend · backend · training · hardware`
|
||||
|
||||
- Range: **[1.0, 10.0]** per domain, starts at 1.0
|
||||
- Tasks require a minimum prestige level. Agent can only accept tasks where `max(company_prestige) >= required_prestige`.
|
||||
- Default distribution: mode=4, so most tasks need prestige 3–5.
|
||||
- First 10 market tasks are stratified `[1,1,1,1,2,2,2,3,3,4]` to bootstrap progression.
|
||||
|
||||
Specialising in 2–3 domains unlocks progressively higher-reward tasks. Spreading thin keeps you locked at low prestige everywhere.
|
||||
|
||||
---
|
||||
|
||||
## Employee throughput
|
||||
|
||||
Each employee has a skill rate (units/hr) per domain.
|
||||
|
||||
When an employee is assigned to N active tasks simultaneously:
|
||||
|
||||
```
|
||||
effective_rate_per_task = base_rate / N
|
||||
```
|
||||
|
||||
Assigning one senior (rate 8.0) to 4 tasks gives 2.0 units/hr each — often worse than a junior focused on one.
|
||||
|
||||
Task completion time = `max(remaining[d] / effective_rate[d])` across all required domains.
|
||||
|
||||
Deadline = `max(7, total_required_qty / deadline_qty_per_day)` business days.
|
||||
|
||||
`deadline_qty_per_day = 200` in both `challenge` and `fast_test`. With 10 employees and 5 focused per domain, team throughput ≈ 230 units/domain/day — achievable for up to ~4 simultaneous tasks.
|
||||
|
||||
---
|
||||
|
||||
## Agent interface
|
||||
|
||||
All commands return JSON to stdout.
|
||||
|
||||
### Observe
|
||||
```bash
|
||||
yc-bench company status # funds, prestige, runway, payroll
|
||||
yc-bench employee list # skills, salary, active tasks
|
||||
yc-bench market browse # available tasks (--limit N --offset N)
|
||||
yc-bench task list [--status X] # planned|active|completed_*|cancelled
|
||||
yc-bench task inspect --task-id UUID # progress %, deadline, assignments
|
||||
yc-bench finance ledger # full transaction history
|
||||
yc-bench report monthly # P&L per month
|
||||
yc-bench scratchpad read # persistent notes (survives context truncation)
|
||||
```
|
||||
|
||||
### Act
|
||||
```bash
|
||||
yc-bench task accept --task-id UUID # pull from market, set deadline
|
||||
yc-bench task assign --task-id UUID --employee-id UUID
|
||||
yc-bench task dispatch --task-id UUID # start work (≥1 assignment required)
|
||||
yc-bench task cancel --task-id UUID --reason "" # 2× prestige penalty
|
||||
yc-bench sim resume # advance to next event
|
||||
yc-bench scratchpad write/append/clear # persistent memory
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Context management
|
||||
|
||||
- **Proactive truncation**: keeps the last 20 conversation rounds before each API call. Older rounds are dropped.
|
||||
- **Scratchpad**: per-company persistent text in DB. Survives truncation. Use it to store strategy, deadlines, and employee assignments.
|
||||
|
||||
---
|
||||
|
||||
## Repository layout
|
||||
|
||||
```
|
||||
YC_Bench/
|
||||
├── src/ # Python package (yc_bench)
|
||||
├── scripts/ # plot_multi_model.py, run_benchmark.sh
|
||||
├── logs/ # per-model stdout/stderr logs
|
||||
├── db/ # SQLite databases (one per model run)
|
||||
├── results/ # JSON rollout files
|
||||
├── plots/ # generated PNG charts
|
||||
├── pyproject.toml
|
||||
└── README.md
|
||||
```
|
||||
The benchmark tests whether agents can manage compounding decisions: prestige specialisation, employee allocation, cash flow, and deadline risk — sustained over hundreds of turns.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -194,8 +21,6 @@ cd YC_Bench
|
|||
uv sync
|
||||
```
|
||||
|
||||
No database setup required — the runner auto-creates `db/<config>_<seed>_<model>.db` on first run.
|
||||
|
||||
### API key
|
||||
|
||||
```bash
|
||||
|
|
@ -206,7 +31,7 @@ OPENROUTER_API_KEY="sk-or-v1-..." # for openrouter/*
|
|||
OPENAI_API_KEY="sk-..." # for openai/*
|
||||
```
|
||||
|
||||
### Run a single model
|
||||
### Run
|
||||
|
||||
```bash
|
||||
uv run yc-bench run \
|
||||
|
|
@ -215,65 +40,61 @@ uv run yc-bench run \
|
|||
--config medium
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `db/medium_1_gemini_gemini-3-flash-preview.db` — SQLite simulation state
|
||||
- `results/yc_bench_result_medium_1_gemini_gemini-3-flash-preview.json` — full rollout + transcript
|
||||
Outputs a SQLite DB in `db/` and a JSON rollout in `results/`.
|
||||
|
||||
### Live dashboard
|
||||
|
||||
When running in a terminal, YC-Bench displays an interactive dashboard that updates in-place after each turn:
|
||||
|
||||
```
|
||||
╭──────────────────────────── YC-Bench ────────────────────────────╮
|
||||
│ Model claude-haiku-4-5-20251001 seed=1 medium │
|
||||
│ Turn 8 │
|
||||
│ Sim Date 2025-03-06 -> 2026-01-01 │
|
||||
│ Elapsed 0h 02m 34s │
|
||||
│ Funds $186,271.66 -$63,728 ██▇▃▁ │
|
||||
│ Runway 5.8mo │
|
||||
│ Tasks 3 active / 3 queued 2 done 1 fail │
|
||||
│ Team 5 people $31,864.17/mo │
|
||||
│ Cost $0.0212 (3.7s/turn) │
|
||||
│ Action yc-bench task dispatch 7 │
|
||||
│ Status >> Turn 9: waiting for LLM... │
|
||||
╰──────────────────────────────────────────────────────────────────╯
|
||||
╭──────────────────────────── Tasks ───────────────────────────────╮
|
||||
│ >> Build GPU Cluster $64,152 2025-02-03 Research ==== Training ====== │
|
||||
│ >> Deploy Observability $27,908 2025-01-22 Data ===... │
|
||||
│ .. Blue-Green Deploy $30,780 2025-03-18 Backend ...... Data ...... │
|
||||
╰──────────────────────────────────────────────────────────────────╯
|
||||
╭──────────────────────────── Team ────────────────────────────────╮
|
||||
│ Alice Chen $2,564 Training===. Frontend==.. Research=... │
|
||||
│ Bob Martinez $14,947 Backend===. Research==.. Data==.. │
|
||||
╰──────────────────────────────────────────────────────────────────╯
|
||||
```
|
||||
|
||||
The dashboard shows:
|
||||
- **Funds sparkline** — visual trend of your cash position over time
|
||||
- **Color-coded progress bars** per domain on each task (green = done, yellow = partial, red = low)
|
||||
- **Employee skill bars** — top 3 skills per team member with strength indicators
|
||||
- **Runway urgency** — green (safe), yellow (low), red blinking (critical)
|
||||
- **Salary heat** — expensive employees highlighted in red
|
||||
|
||||
To disable the dashboard and see raw log output instead:
|
||||
|
||||
```bash
|
||||
uv run yc-bench run --model ... --seed 1 --config medium --no-live
|
||||
```
|
||||
|
||||
When `--no-live` is set (or stdout is not a terminal, e.g. piped to a file), the original logging output is used. Debug logs from LiteLLM/httpx are written to `logs/debug.log` when the dashboard is active.
|
||||
|
||||
### Run 5 models in parallel
|
||||
### Run multiple models in parallel
|
||||
|
||||
```bash
|
||||
bash scripts/run_benchmark.sh --seed 1 --config challenge
|
||||
```
|
||||
|
||||
### Generate the comparison plot
|
||||
---
|
||||
|
||||
## How it works
|
||||
|
||||

|
||||
|
||||
### Core loop
|
||||
|
||||
1. Agent calls `yc-bench sim resume` to advance time to the next event.
|
||||
2. The engine flushes task progress, fires due events, applies payroll.
|
||||
3. Agent reads wake events and decides: accept tasks, assign employees, dispatch, cancel.
|
||||
4. Repeat until bankruptcy or horizon end.
|
||||
|
||||
The simulation ends on **bankruptcy** (funds < 0 after payroll), **horizon end** (1–3 years), or **max turns** (if configured). If the agent doesn't call `sim resume` for 10 consecutive turns, the loop forces one automatically.
|
||||
|
||||
### Key mechanics
|
||||
|
||||
- **Funds**: start at $250K. Monthly payroll is deducted automatically. Task rewards scale with prestige (`base × (1 + 0.55 × (prestige − 1))`).
|
||||
- **4 domains**: `research · inference · data/environment · training`. Each domain tracks prestige independently in [1.0, 10.0].
|
||||
- **Prestige gating**: tasks require a minimum prestige level. Most tasks need prestige 3–5, so the agent must climb from 1.0 by completing easier tasks first. First 10 market tasks are stratified `[1,1,1,1,2,2,2,3,3,4]` to bootstrap progression.
|
||||
- **Employees**: 10 employees across 3 tiers (junior/mid/senior). The agent sees only each employee's tier and salary — not their per-domain skill rates. A junior can secretly be a superstar in one domain, so the agent must infer productivity from task progress observations.
|
||||
- **Throughput splitting**: an employee assigned to N active tasks has `effective_rate = base_rate / N`. Focus beats breadth.
|
||||
- **Task success**: on-time completion awards funds + prestige + skill boosts + 1% salary bump (compounding payroll pressure). Late completion penalises prestige (1.4×). Cancellation penalises harder (2.0×).
|
||||
- **Progress checkpoints**: the agent is woken at 25%, 50%, 75%, and 100% completion — providing data points to estimate employee productivity.
|
||||
- **Scratchpad**: persistent notes in the DB that survive context truncation (only last 20 conversation rounds are kept).
|
||||
|
||||
### Agent CLI
|
||||
|
||||
All commands return JSON. The agent interacts via `run_command("yc-bench <cmd>")`.
|
||||
|
||||
```bash
|
||||
uv run python scripts/plot_multi_model.py --seed 1 --config challenge --budget 30
|
||||
# → plots/funds_curves.png
|
||||
# Observe
|
||||
yc-bench company status # funds, prestige, runway
|
||||
yc-bench employee list # tier, salary, active tasks
|
||||
yc-bench market browse [--domain X] [--limit N] # available tasks
|
||||
yc-bench task list [--status X] # your tasks
|
||||
yc-bench task inspect --task-id UUID # progress, deadline, assignments
|
||||
yc-bench finance ledger # transaction history
|
||||
yc-bench report monthly # P&L per month
|
||||
|
||||
# Act
|
||||
yc-bench task accept --task-id UUID # pull from market
|
||||
yc-bench task assign --task-id UUID --employee-id UUID
|
||||
yc-bench task dispatch --task-id UUID # start work
|
||||
yc-bench task cancel --task-id UUID --reason "" # cancel (2× prestige penalty)
|
||||
yc-bench sim resume # advance time
|
||||
yc-bench scratchpad write/append/clear # persistent memory
|
||||
```
|
||||
|
||||
---
|
||||
|
|
@ -282,90 +103,15 @@ uv run python scripts/plot_multi_model.py --seed 1 --config challenge --budget 3
|
|||
|
||||
Experiment presets live in `src/yc_bench/config/presets/` as TOML files. Pass the preset name via `--config`.
|
||||
|
||||
```
|
||||
src/yc_bench/config/presets/
|
||||
├── default.toml # 3yr, 10 employees, 500 tasks — base config
|
||||
├── tutorial.toml # 1yr, 3 employees, 50 tasks — learn the loop
|
||||
├── easy.toml # 1yr, 5 employees, 100 tasks — throughput awareness
|
||||
├── medium.toml # 1yr, 5 employees, 150 tasks — prestige strategy
|
||||
├── hard.toml # 1yr, 7 employees, 200 tasks — precise ETA reasoning
|
||||
├── nightmare.toml # 1yr, 8 employees, 300 tasks — sustained perfection
|
||||
├── challenge.toml # 3yr, 5 employees, 200 tasks — long-horizon endurance
|
||||
└── fast_test.toml # 1yr, 5 employees, 100 tasks — quick iteration
|
||||
```
|
||||
| Config | Employees | Tasks | Tests |
|
||||
|--------|-----------|-------|-------|
|
||||
| **tutorial** | 3 | 50 | Basic accept→assign→dispatch loop |
|
||||
| **easy** | 5 | 100 | Throughput awareness |
|
||||
| **medium** | 5 | 150 | Prestige climbing + domain specialization |
|
||||
| **hard** | 7 | 200 | Precise ETA reasoning |
|
||||
| **nightmare** | 8 | 300 | Sustained perfection under compounding payroll |
|
||||
|
||||
Each difficulty level tests one additional concept:
|
||||
|
||||
| Config | Tests | Key constraint |
|
||||
|--------|-------|---------------|
|
||||
| **tutorial** | Basic accept→assign→dispatch loop | All prestige-1, single domain |
|
||||
| **easy** | Throughput awareness | Don't over-parallelize |
|
||||
| **medium** | Prestige climbing + domain specialization | 2-domain tasks, prestige mode=3 |
|
||||
| **hard** | Precise ETA computation | One bad accept degrades in-flight tasks |
|
||||
| **nightmare** | Sustained perfection under compounding payroll | One failure ≈ fatal, salary bumps 2%/task |
|
||||
|
||||
### Key WorldConfig parameters
|
||||
|
||||
| Parameter | Default | Controls |
|
||||
|-----------|---------|---------|
|
||||
| `initial_funds_cents` | 25_000_000 | Starting cash ($250K) |
|
||||
| `num_employees` | 5 | Workforce size |
|
||||
| `num_market_tasks` | 100 | Market pool size |
|
||||
| `required_prestige_mode` | 4 | Peak of prestige-req distribution |
|
||||
| `domain_count_mode` | 2 | Most tasks require 2 domains |
|
||||
| `required_qty_low/mode` | 500 / 1400 | Task work volume (units) |
|
||||
| `deadline_qty_per_day` | 200 | Units completable per biz day (lower = easier) |
|
||||
| `deadline_min_biz_days` | 7 | Minimum deadline |
|
||||
| `penalty_fail_multiplier` | 1.4 | Prestige × this on deadline miss |
|
||||
| `penalty_cancel_multiplier` | 2.0 | Prestige × this on cancel |
|
||||
| `reward_prestige_scale` | 0.55 | Extra reward fraction per prestige level above 1 |
|
||||
| `salary_bump_pct` | 0.01 | Salary raise per employee per completed task |
|
||||
|
||||
### AgentConfig
|
||||
|
||||
| Parameter | Default | Controls |
|
||||
|-----------|---------|---------|
|
||||
| `model` | openrouter/openai/gpt-4o-mini | LLM model string |
|
||||
| `temperature` | 0.0 | Sampling temperature |
|
||||
| `history_keep_rounds` | 20 | Conversation rounds kept in context |
|
||||
|
||||
### LoopConfig
|
||||
|
||||
| Parameter | Default | Controls |
|
||||
|-----------|---------|---------|
|
||||
| `auto_advance_after_turns` | 5 | Force sim resume after N turns without one |
|
||||
| `max_turns` | 50 | Hard cap on agent turns (null = unlimited) |
|
||||
|
||||
### Environment overrides
|
||||
|
||||
```bash
|
||||
YC_BENCH_EXPERIMENT=fast_test # select preset
|
||||
DATABASE_URL=sqlite:///custom.db # SQLite path
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Terminal conditions
|
||||
|
||||
| Condition | Trigger |
|
||||
|-----------|---------|
|
||||
| Horizon end | `sim_time >= start_date + horizon_years` |
|
||||
| Bankruptcy | `funds_cents < 0` after any payroll |
|
||||
| Error | Agent runtime exception (API failure, exhausted retries) |
|
||||
| Max turns | `turn_count >= max_turns` (if set) |
|
||||
|
||||
---
|
||||
|
||||
## What makes it hard
|
||||
|
||||
The hardened default is designed so that the obvious strategies fail:
|
||||
|
||||
- **Prestige-1 farming** is unprofitable. Most replacement tasks need prestige 3–5 and pay much more. Farming the bottom locks you out.
|
||||
- **Single-specialist dominance** is gone. Most tasks need 2 domains. You must allocate across skill combinations.
|
||||
- **Speculative accepting** is punished. Cancel penalty (2×) exceeds fail penalty (1.4×) so you can't accept everything and drop the losers.
|
||||
- **Ignoring payroll** causes bankruptcy. ~$32K/month burns your $250K in 7.8 months — but task complexity means you must also pace your accepts.
|
||||
- **Parallel dispatch** dilutes throughput. Splitting employees across too many tasks extends every deadline — focus beats breadth.
|
||||
- **Salary bumps compound**. Every task completion raises assigned employee salaries 1%. Payroll creep accelerates over time.
|
||||
See `default.toml` for the full list of tunable parameters.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -375,15 +121,15 @@ The hardened default is designed so that the obvious strategies fail:
|
|||
|
||||

|
||||
|
||||
#### Survival rates (at end of year 1)
|
||||
#### Survival rates
|
||||
|
||||
| Config | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
|
||||
|--------|-----------|----------------|---------|
|
||||
| **medium** | 3/3 survived | 3/3 survived | 3/3 survived |
|
||||
| **hard** | 1/3 survived | 2/3 survived | 2/3 survived |
|
||||
| **nightmare** | 1/3 survived | 3/3 survived | 2/3 survived |
|
||||
| **medium** | 3/3 | 3/3 | 3/3 |
|
||||
| **hard** | 1/3 | 2/3 | 2/3 |
|
||||
| **nightmare** | 1/3 | 3/3 | 2/3 |
|
||||
|
||||
#### Final funds at 1-year mark (bankrupt = funds < 0)
|
||||
#### Final funds (bankrupt = funds < 0)
|
||||
|
||||
| Config | Seed | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
|
||||
|--------|------|-----------|----------------|---------|
|
||||
|
|
@ -399,82 +145,21 @@ The hardened default is designed so that the obvious strategies fail:
|
|||
|
||||
**Overall: Gemini 8/9 · GPT-5.2 7/9 · Sonnet 5/9**
|
||||
|
||||
### Key findings
|
||||
#### Key findings
|
||||
|
||||
**Gemini leads on consistency (8/9).** Near-perfect win rates on medium (93–98%), and the only model to sweep all 3 nightmare seeds. Achieves this without using the scratchpad — purely reactive, high-frequency decision-making.
|
||||
- **Gemini leads on consistency** (8/9 survival). The only model to sweep all 3 nightmare seeds.
|
||||
- **GPT-5.2 has the highest ceiling.** Hard seed 3: $43.5M vs Gemini's $21.9M. When it survives, it tends to outperform by a wide margin.
|
||||
- **Sonnet is high-variance.** Nightmare seed 2: $10.1M (best nightmare result), but 4/9 bankruptcies overall.
|
||||
- **Win rate predicts survival.** Every run with >58% task win rate survived. Every run below 40% went bankrupt.
|
||||
|
||||
**GPT-5.2 excels at hard (2/3, matching Gemini) with the highest absolute returns.** Hard seed 3: $43.5M vs Gemini's $21.9M. Nightmare seed 3: $23.6M vs Gemini's $805K. When GPT-5.2 survives, it tends to outperform by a significant margin.
|
||||
|
||||
**Sonnet has the highest ceiling when it works but the lowest floor.** Nightmare seed 2: $10.1M (best nightmare result). But 4/9 bankruptcies — Sonnet fails harder than the others on adverse seeds.
|
||||
|
||||
**Hard is the differentiator config.** On easy configs all three survive. On hard/nightmare the strategies diverge sharply. Gemini plays safe and consistent; GPT-5.2 swings big; Sonnet is high-variance.
|
||||
|
||||
**Win rate predicts survival.** Every run with >58% task win rate survived. Every run with <40% went bankrupt. Below that threshold, prestige losses from failures outpace gains and lock the agent out of profitable tasks.
|
||||
|
||||
### Prestige specialization
|
||||
#### Prestige specialization
|
||||
|
||||

|
||||
|
||||
Each radar shows final prestige across 7 domains (1 = center, 10 = edge). Large polygons = the model climbed prestige broadly. Tiny dots near center = bankrupt before gaining any prestige. Pointy shapes = domain specialization.
|
||||
|
||||
**Human Devised Rule** (navy dashed) consistently fills the full radar — it methodically maxes prestige everywhere. Among LLMs, **Gemini** builds the most balanced prestige profiles. **GPT-5.2** shows clear specialization on medium (backend/data/frontend high, training untouched). **Sonnet** is bimodal: either maxes everything (medium seed 1) or collapses entirely (nightmare seeds 1 & 3).
|
||||
|
||||
### Why models fail
|
||||
|
||||
The scratchpad evolution of Sonnet on hard seed 2 tells the full story:
|
||||
|
||||

|
||||
|
||||
Common failure patterns across all bankrupt runs:
|
||||
|
||||
1. **Over-parallelization.** Accepting 3–5 tasks at once, splitting employees across them. Effective rate per task drops below deadline requirements. Sonnet nightmare seed 3 ran 5 tasks simultaneously with 8 employees on turn 13.
|
||||
2. **No prestige gating.** Accepting prestige-2 tasks when company prestige is 1.0. The task completes late, triggers a 1.4× prestige penalty, and the agent ends up worse than before.
|
||||
3. **Late adaptation.** Sonnet correctly identifies problems in its scratchpad ("PRESTIGE CRISIS — MARKET LOCK") but only after payroll has consumed the runway. By turn 137 of hard seed 2, all tasks require prestige ≥ 2 but the company is stuck at 1.0 in 6 of 7 domains.
|
||||
4. **Inconsistent ETA reasoning.** Sonnet's medium seed 2 has a 49% win rate — essentially a coin flip. It understands throughput math in its scratchpad but doesn't consistently apply it when selecting tasks.
|
||||
|
||||
---
|
||||
|
||||
## Simulation rules
|
||||
Please cite our work if you find it useful!
|
||||
|
||||
- **Business time**: weekdays only, 09:00–18:00. No leap years.
|
||||
- **Money**: stored as integer cents (`BIGINT`). No floating point.
|
||||
- **Payroll**: fired on the first business day of each month.
|
||||
- **Event ordering**: deterministic — `(scheduled_at, priority, id)`.
|
||||
- **Determinism**: all task generation and employee seeding is reproducible given `--seed`.
|
||||
- **Prestige**: `NUMERIC(6,3)`, hard clamped to `[1.0, 10.0]`.
|
||||
- **DB reuse**: if a simulation is terminal (bankrupt or horizon reached), re-running with the same DB wipes and reseeds cleanly.
|
||||
|
||||
---
|
||||
|
||||
## Output format
|
||||
|
||||
`results/yc_bench_result_<config>_<seed>_<model>.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"session_id": "run-1-openrouter/openai/gpt-4o-mini",
|
||||
"model": "openrouter/openai/gpt-4o-mini",
|
||||
"seed": 1,
|
||||
"horizon_years": 1,
|
||||
"turns_completed": 46,
|
||||
"terminal": true,
|
||||
"terminal_reason": "bankruptcy",
|
||||
"total_cost_usd": 0.100008,
|
||||
"started_at": "...",
|
||||
"ended_at": "...",
|
||||
"transcript": [
|
||||
{
|
||||
"turn": 1,
|
||||
"timestamp": "...",
|
||||
"user_input": "## Simulation Start ...",
|
||||
"agent_output": "Executed 3 tool call(s): ...",
|
||||
"commands_executed": ["yc-bench company status -> {...}", ...]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Please cite our work if you find it useful and interesting!
|
||||
```bibtex
|
||||
@misc{collinear-ai2025ycbench,
|
||||
author = {{Collinear AI}},
|
||||
|
|
|
|||
|
|
@ -344,13 +344,12 @@ def run_bot(config_name: str, seed: int, bot_slug: str, strategy_fn: StrategyFn)
|
|||
company_id=None,
|
||||
status=TaskStatus.MARKET,
|
||||
title=replacement.title,
|
||||
description=replacement.description,
|
||||
required_prestige=replacement.required_prestige,
|
||||
reward_funds_cents=replacement.reward_funds_cents,
|
||||
reward_prestige_delta=replacement.reward_prestige_delta,
|
||||
skill_boost_pct=replacement.skill_boost_pct,
|
||||
accepted_at=None, deadline=None, completed_at=None,
|
||||
success=None, halfway_event_emitted=False,
|
||||
success=None, progress_milestone_pct=0,
|
||||
)
|
||||
db.add(replacement_row)
|
||||
for domain, qty in replacement.requirements.items():
|
||||
|
|
@ -375,7 +374,7 @@ def run_bot(config_name: str, seed: int, bot_slug: str, strategy_fn: StrategyFn)
|
|||
|
||||
recalculate_etas(db, company_id, sim_state.sim_time,
|
||||
impacted_task_ids={best_task.id},
|
||||
half_threshold=world_cfg.task_half_threshold)
|
||||
milestones=world_cfg.task_progress_milestones)
|
||||
|
||||
task_cycles_used += 1
|
||||
|
||||
|
|
|
|||
|
|
@ -50,8 +50,8 @@ CONFIGS = ["medium", "hard", "nightmare"]
|
|||
SEEDS = [1, 2, 3]
|
||||
DIFF_COLORS = {"medium": BLUE, "hard": ORANGE, "nightmare": "#DC2626"}
|
||||
|
||||
DOMAINS = ["system", "research", "data", "frontend", "backend", "training", "hardware"]
|
||||
DOMAIN_LABELS = ["SYS", "RES", "DATA", "FE", "BE", "TRAIN", "HW"]
|
||||
DOMAINS = ["research", "inference", "data_environment", "training"]
|
||||
DOMAIN_LABELS = ["RES", "INF", "DATA/ENV", "TRAIN"]
|
||||
|
||||
|
||||
def load_logo_image(height_px=80):
|
||||
|
|
|
|||
|
|
@ -23,13 +23,10 @@ engine = build_engine()
|
|||
factory = build_session_factory(engine)
|
||||
|
||||
DOMAIN_COLORS = {
|
||||
"training": "#e67e22",
|
||||
"research": "#3498db",
|
||||
"backend": "#2ecc71",
|
||||
"hardware": "#9b59b6",
|
||||
"data": "#1abc9c",
|
||||
"frontend": "#e74c3c",
|
||||
"system": "#95a5a6",
|
||||
"research": "#3498db",
|
||||
"inference": "#9b59b6",
|
||||
"data_environment": "#1abc9c",
|
||||
"training": "#e67e22",
|
||||
}
|
||||
|
||||
with session_scope(factory) as db:
|
||||
|
|
|
|||
|
|
@ -18,7 +18,7 @@ Your goal is to maximize company prestige and funds over the simulation horizon
|
|||
|
||||
### Observe
|
||||
- `yc-bench company status` — funds, prestige, employee count, payroll, bankruptcy risk
|
||||
- `yc-bench employee list` — list all employees with IDs, salaries, skill rates, and current assignments
|
||||
- `yc-bench employee list` — list all employees with IDs, tier (junior/mid/senior), salaries, and current assignments
|
||||
- `yc-bench market browse [--domain X] [--required-prestige-lte N] [--reward-min-cents N] [--limit N] [--offset N]` — browse available tasks (default limit 50; the response includes a `total` field — if total > 50, paginate with --offset to see more)
|
||||
- `yc-bench task list [--status X]` — list your tasks (planned, active, completed, cancelled)
|
||||
- `yc-bench task inspect --task-id <UUID>` — detailed task info (requirements, assignments, progress)
|
||||
|
|
@ -106,7 +106,8 @@ def build_turn_context(
|
|||
tid = ev.get("task_id", "?")
|
||||
parts.append(f"- Task {tid}: {'SUCCESS' if success else 'FAILED'}")
|
||||
elif ev_type == "task_half":
|
||||
parts.append(f"- Task {ev.get('task_id', '?')}: 50% progress reached")
|
||||
pct = ev.get("milestone_pct", "?")
|
||||
parts.append(f"- Task {ev.get('task_id', '?')}: {pct}% progress reached")
|
||||
elif ev_type == "horizon_end":
|
||||
parts.append("- **Horizon end reached. Simulation complete.**")
|
||||
elif ev_type == "bankruptcy":
|
||||
|
|
|
|||
|
|
@ -9,7 +9,7 @@ from uuid import UUID
|
|||
|
||||
import typer
|
||||
|
||||
from ..db.session import build_engine, build_session_factory, session_scope
|
||||
from ..db.session import build_engine, build_session_factory, init_db, session_scope
|
||||
|
||||
app = typer.Typer(name="yc-bench", add_completion=False)
|
||||
|
||||
|
|
@ -22,6 +22,7 @@ app = typer.Typer(name="yc-bench", add_completion=False)
|
|||
def get_db():
|
||||
"""Yield a transactional SQLAlchemy session, commit on success."""
|
||||
engine = build_engine()
|
||||
init_db(engine)
|
||||
factory = build_session_factory(engine)
|
||||
with session_scope(factory) as session:
|
||||
yield session
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@ from __future__ import annotations
|
|||
import typer
|
||||
from sqlalchemy import func
|
||||
|
||||
from ..db.models.employee import Employee, EmployeeSkillRate
|
||||
from ..db.models.employee import Employee
|
||||
from ..db.models.task import Task, TaskAssignment, TaskStatus
|
||||
from ..db.models.sim_state import SimState
|
||||
from . import get_db, json_output, error_output
|
||||
|
|
@ -25,15 +25,6 @@ def employee_list():
|
|||
|
||||
results = []
|
||||
for emp in employees:
|
||||
# Skills
|
||||
skills = db.query(EmployeeSkillRate).filter(
|
||||
EmployeeSkillRate.employee_id == emp.id
|
||||
).all()
|
||||
skill_map = {
|
||||
s.domain.value: float(s.rate_domain_per_hour)
|
||||
for s in skills
|
||||
}
|
||||
|
||||
# Current active assignments
|
||||
active_assignments = (
|
||||
db.query(TaskAssignment.task_id)
|
||||
|
|
@ -49,9 +40,9 @@ def employee_list():
|
|||
results.append({
|
||||
"employee_id": str(emp.id),
|
||||
"name": emp.name,
|
||||
"tier": emp.tier,
|
||||
"salary_cents": emp.salary_cents,
|
||||
"work_hours_per_day": float(emp.work_hours_per_day),
|
||||
"skills": skill_map,
|
||||
"active_task_count": len(active_task_ids),
|
||||
"active_task_ids": active_task_ids,
|
||||
})
|
||||
|
|
|
|||
|
|
@ -58,7 +58,6 @@ def market_browse(
|
|||
results.append({
|
||||
"task_id": str(task.id),
|
||||
"title": task.title,
|
||||
"description": task.description,
|
||||
"required_prestige": task.required_prestige,
|
||||
"reward_funds_cents": task.reward_funds_cents,
|
||||
"reward_prestige_delta": float(task.reward_prestige_delta),
|
||||
|
|
|
|||
|
|
@ -95,7 +95,6 @@ def task_accept(
|
|||
company_id=None,
|
||||
status=TaskStatus.MARKET,
|
||||
title=replacement.title,
|
||||
description=replacement.description,
|
||||
required_prestige=replacement.required_prestige,
|
||||
reward_funds_cents=replacement.reward_funds_cents,
|
||||
reward_prestige_delta=replacement.reward_prestige_delta,
|
||||
|
|
@ -104,7 +103,7 @@ def task_accept(
|
|||
deadline=None,
|
||||
completed_at=None,
|
||||
success=None,
|
||||
halfway_event_emitted=False,
|
||||
progress_milestone_pct=0,
|
||||
)
|
||||
db.add(replacement_row)
|
||||
|
||||
|
|
@ -185,7 +184,7 @@ def task_assign(
|
|||
if t and t.status == TaskStatus.ACTIVE:
|
||||
impacted.add(t.id)
|
||||
if impacted:
|
||||
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, half_threshold=_get_world_cfg().task_half_threshold)
|
||||
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, milestones=_get_world_cfg().task_progress_milestones)
|
||||
|
||||
# Return current assignment list
|
||||
assignments = db.query(TaskAssignment).filter(TaskAssignment.task_id == tid).all()
|
||||
|
|
@ -251,7 +250,7 @@ def task_dispatch(
|
|||
peer_task = db.query(Task).filter(Task.id == pa.task_id).one_or_none()
|
||||
if peer_task and peer_task.status == TaskStatus.ACTIVE:
|
||||
impacted.add(peer_task.id)
|
||||
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, half_threshold=_get_world_cfg().task_half_threshold)
|
||||
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, milestones=_get_world_cfg().task_progress_milestones)
|
||||
|
||||
json_output({
|
||||
"task_id": str(task.id),
|
||||
|
|
@ -353,7 +352,6 @@ def task_inspect(
|
|||
json_output({
|
||||
"task_id": str(task.id),
|
||||
"title": task.title,
|
||||
"description": task.description,
|
||||
"status": task.status.value,
|
||||
"required_prestige": task.required_prestige,
|
||||
"reward_funds_cents": task.reward_funds_cents,
|
||||
|
|
@ -442,7 +440,7 @@ def task_cancel(
|
|||
if t and t.status == TaskStatus.ACTIVE:
|
||||
impacted.add(t.id)
|
||||
if impacted:
|
||||
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, half_threshold=_get_world_cfg().task_half_threshold)
|
||||
recalculate_etas(db, sim_state.company_id, sim_state.sim_time, impacted, milestones=_get_world_cfg().task_progress_milestones)
|
||||
|
||||
# Bankruptcy check
|
||||
company = db.query(Company).filter(Company.id == sim_state.company_id).one()
|
||||
|
|
|
|||
|
|
@ -82,8 +82,8 @@ reward_prestige_scale = 0.55 # hardened: was 0.3
|
|||
deadline_qty_per_day = 320.0 # hardened: was 200.0
|
||||
deadline_min_biz_days = 7
|
||||
|
||||
# --- Progress milestone ---
|
||||
task_half_threshold = 0.5
|
||||
# --- Progress milestones (checkpoint events at these completion fractions) ---
|
||||
task_progress_milestones = [0.25, 0.5, 0.75]
|
||||
|
||||
# --- Business hours ---
|
||||
workday_start_hour = 9
|
||||
|
|
@ -161,20 +161,20 @@ share = 0.50
|
|||
min_cents = 200_000 # $2,000/month
|
||||
max_cents = 400_000 # $4,000/month
|
||||
rate_min = 1.0 # units/hour
|
||||
rate_max = 6.5
|
||||
rate_max = 4.0
|
||||
|
||||
[world.salary_mid]
|
||||
name = "mid"
|
||||
share = 0.35
|
||||
min_cents = 600_000 # $6,000/month
|
||||
max_cents = 800_000 # $8,000/month
|
||||
rate_min = 3.5
|
||||
rate_max = 8.5
|
||||
rate_min = 4.0
|
||||
rate_max = 7.0
|
||||
|
||||
[world.salary_senior]
|
||||
name = "senior"
|
||||
share = 0.15
|
||||
min_cents = 1_000_000 # $10,000/month
|
||||
max_cents = 1_500_000 # $15,000/month
|
||||
rate_min = 5.5
|
||||
rate_min = 7.0
|
||||
rate_max = 10.0
|
||||
|
|
|
|||
|
|
@ -128,8 +128,8 @@ class WorldConfig(BaseModel):
|
|||
deadline_qty_per_day: float = 200.0 # work units assumed completable per business day
|
||||
deadline_min_biz_days: int = 7
|
||||
|
||||
# --- Progress milestone ---
|
||||
task_half_threshold: float = 0.5
|
||||
# --- Progress milestones (fraction thresholds that trigger checkpoint events) ---
|
||||
task_progress_milestones: list[float] = Field(default_factory=lambda: [0.25, 0.5, 0.75])
|
||||
|
||||
# --- Business hours ---
|
||||
workday_start_hour: int = 9
|
||||
|
|
@ -143,21 +143,21 @@ class WorldConfig(BaseModel):
|
|||
default_factory=lambda: SalaryTierConfig(
|
||||
name="junior", share=0.50,
|
||||
min_cents=200_000, max_cents=400_000,
|
||||
rate_min=1.0, rate_max=6.5,
|
||||
rate_min=1.0, rate_max=4.0,
|
||||
)
|
||||
)
|
||||
salary_mid: SalaryTierConfig = Field(
|
||||
default_factory=lambda: SalaryTierConfig(
|
||||
name="mid", share=0.35,
|
||||
min_cents=600_000, max_cents=800_000,
|
||||
rate_min=3.5, rate_max=8.5,
|
||||
rate_min=4.0, rate_max=7.0,
|
||||
)
|
||||
)
|
||||
salary_senior: SalaryTierConfig = Field(
|
||||
default_factory=lambda: SalaryTierConfig(
|
||||
name="senior", share=0.15,
|
||||
min_cents=1_000_000, max_cents=1_500_000,
|
||||
rate_min=5.5, rate_max=10.0,
|
||||
rate_min=7.0, rate_max=10.0,
|
||||
)
|
||||
)
|
||||
|
||||
|
|
|
|||
|
|
@ -74,13 +74,16 @@ def dispatch_event(db: Session, event: SimEvent, sim_time: datetime, company_id:
|
|||
"""Route event to appropriate handler. Returns result dict."""
|
||||
if event.event_type == EventType.TASK_HALF_PROGRESS:
|
||||
result = handle_task_half(db, event)
|
||||
return {"type": "task_half", "task_id": str(result.task_id), "handled": result.handled}
|
||||
# Recalculate ETAs so the next milestone is scheduled
|
||||
from ..config import get_world_config
|
||||
recalculate_etas(db, company_id, sim_time, milestones=get_world_config().task_progress_milestones)
|
||||
return {"type": "task_half", "task_id": str(result.task_id), "milestone_pct": result.milestone_pct, "handled": result.handled}
|
||||
|
||||
elif event.event_type == EventType.TASK_COMPLETED:
|
||||
result = handle_task_complete(db, event, sim_time)
|
||||
# Recalculate ETAs — freed employees change topology
|
||||
from ..config import get_world_config
|
||||
recalculate_etas(db, company_id, sim_time, half_threshold=get_world_config().task_half_threshold)
|
||||
recalculate_etas(db, company_id, sim_time, milestones=get_world_config().task_progress_milestones)
|
||||
return {
|
||||
"type": "task_completed",
|
||||
"task_id": str(result.task_id),
|
||||
|
|
|
|||
|
|
@ -185,15 +185,20 @@ def recalculate_etas(
|
|||
company_id: UUID,
|
||||
now: datetime,
|
||||
impacted_task_ids: Optional[Set[UUID]] = None,
|
||||
milestones: Optional[List[float]] = None,
|
||||
# Legacy single-threshold parameter — ignored if milestones is provided.
|
||||
half_threshold: float = 0.5,
|
||||
) -> None:
|
||||
"""Recalculate projection events for active tasks.
|
||||
|
||||
1. Delete stale projection events for impacted tasks (or all if None).
|
||||
2. Compute effective rates.
|
||||
3. For each active task, solve completion and halfway times.
|
||||
3. For each active task, solve completion and milestone times.
|
||||
4. Insert new projection events.
|
||||
"""
|
||||
if milestones is None:
|
||||
milestones = [half_threshold]
|
||||
|
||||
# Determine which tasks to recalculate
|
||||
if impacted_task_ids is None:
|
||||
active_tasks = db.query(Task).filter(
|
||||
|
|
@ -240,18 +245,26 @@ def recalculate_etas(
|
|||
dedupe_key=f"task:{tid}:completed",
|
||||
)
|
||||
|
||||
# Halfway ETA (only if not already emitted)
|
||||
if not task.halfway_event_emitted:
|
||||
halfway_time = solve_task_halfway_time(db, tid, now, rates, half_threshold=half_threshold)
|
||||
if halfway_time is not None:
|
||||
# Progress milestone ETAs — skip milestones already emitted
|
||||
emitted_pct = task.progress_milestone_pct or 0
|
||||
for milestone in sorted(milestones):
|
||||
milestone_pct = int(milestone * 100)
|
||||
if milestone_pct <= emitted_pct:
|
||||
continue
|
||||
milestone_time = solve_task_halfway_time(db, tid, now, rates, half_threshold=milestone)
|
||||
if milestone_time is not None:
|
||||
insert_event(
|
||||
db,
|
||||
company_id=company_id,
|
||||
event_type=EventType.TASK_HALF_PROGRESS,
|
||||
scheduled_at=halfway_time,
|
||||
payload={"task_id": str(tid)},
|
||||
dedupe_key=f"task:{tid}:half",
|
||||
scheduled_at=milestone_time,
|
||||
payload={"task_id": str(tid), "milestone_pct": milestone_pct},
|
||||
dedupe_key=f"task:{tid}:milestone:{milestone_pct}",
|
||||
)
|
||||
# Only insert the next upcoming milestone — it will be the
|
||||
# earliest event; once consumed, recalculate_etas runs again
|
||||
# and inserts the following one.
|
||||
break
|
||||
|
||||
db.flush()
|
||||
|
||||
|
|
|
|||
|
|
@ -1,4 +1,4 @@
|
|||
"""Handler for task_half_progress events."""
|
||||
"""Handler for task progress milestone events."""
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
|
|
@ -14,17 +14,19 @@ from ...db.models.task import Task
|
|||
class TaskHalfResult:
|
||||
task_id: UUID
|
||||
handled: bool
|
||||
milestone_pct: int
|
||||
|
||||
|
||||
def handle_task_half(db: Session, event: SimEvent) -> TaskHalfResult:
|
||||
"""Mark the task's halfway_event_emitted flag as True."""
|
||||
"""Record the progress milestone on the task."""
|
||||
task_id = UUID(event.payload["task_id"])
|
||||
milestone_pct = event.payload.get("milestone_pct", 50)
|
||||
task = db.query(Task).filter(Task.id == task_id).one_or_none()
|
||||
|
||||
if task is None:
|
||||
return TaskHalfResult(task_id=task_id, handled=False)
|
||||
return TaskHalfResult(task_id=task_id, handled=False, milestone_pct=milestone_pct)
|
||||
|
||||
task.halfway_event_emitted = True
|
||||
task.progress_milestone_pct = max(task.progress_milestone_pct or 0, milestone_pct)
|
||||
db.flush()
|
||||
|
||||
return TaskHalfResult(task_id=task_id, handled=True)
|
||||
return TaskHalfResult(task_id=task_id, handled=True, milestone_pct=milestone_pct)
|
||||
|
|
|
|||
|
|
@ -10,13 +10,10 @@ from sqlalchemy.orm import mapped_column
|
|||
from ..base import Base
|
||||
|
||||
class Domain(str, Enum):
|
||||
SYSTEM = "system"
|
||||
RESEARCH = "research"
|
||||
DATA = "data"
|
||||
FRONTEND = "frontend"
|
||||
BACKEND = "backend"
|
||||
INFERENCE = "inference"
|
||||
DATA_ENVIRONMENT = "data_environment"
|
||||
TRAINING = "training"
|
||||
HARDWARE = "hardware"
|
||||
|
||||
class Company(Base):
|
||||
__tablename__ = "companies"
|
||||
|
|
|
|||
|
|
@ -30,6 +30,11 @@ class Employee(Base):
|
|||
String(255),
|
||||
nullable=False,
|
||||
)
|
||||
tier = mapped_column(
|
||||
String(20),
|
||||
nullable=False,
|
||||
default="junior",
|
||||
)
|
||||
work_hours_per_day = mapped_column(
|
||||
Numeric(5, 2),
|
||||
nullable=False,
|
||||
|
|
|
|||
|
|
@ -45,10 +45,6 @@ class Task(Base):
|
|||
String(255),
|
||||
nullable=False,
|
||||
)
|
||||
description = mapped_column(
|
||||
String,
|
||||
nullable=False,
|
||||
)
|
||||
required_prestige = mapped_column(
|
||||
Integer,
|
||||
nullable=False,
|
||||
|
|
@ -81,11 +77,11 @@ class Task(Base):
|
|||
Boolean,
|
||||
nullable=True,
|
||||
)
|
||||
halfway_event_emitted = mapped_column(
|
||||
Boolean,
|
||||
progress_milestone_pct = mapped_column(
|
||||
Integer,
|
||||
nullable=False,
|
||||
default=False,
|
||||
server_default=text("false"),
|
||||
default=0,
|
||||
server_default=text("0"),
|
||||
)
|
||||
|
||||
class TaskRequirement(Base):
|
||||
|
|
|
|||
|
|
@ -18,13 +18,10 @@ SPARK_CHARS = "▁▂▃▄▅▆▇█"
|
|||
|
||||
# Domain → (display name, color) for styled inline display
|
||||
DOMAIN_STYLE = {
|
||||
"system": ("System", "bright_cyan"),
|
||||
"research": ("Research", "bright_magenta"),
|
||||
"data": ("Data", "bright_blue"),
|
||||
"frontend": ("Frontend", "bright_yellow"),
|
||||
"backend": ("Backend", "bright_green"),
|
||||
"training": ("Training", "red"),
|
||||
"hardware": ("Hardware", "white"),
|
||||
"research": ("Research", "bright_magenta"),
|
||||
"inference": ("Inference", "bright_cyan"),
|
||||
"data_environment": ("Data/Env", "bright_blue"),
|
||||
"training": ("Training", "red"),
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -132,7 +129,7 @@ def _query_detailed_snapshot(db_factory, company_id) -> dict[str, Any]:
|
|||
]
|
||||
deadline_str = t.deadline.strftime("%Y-%m-%d") if t.deadline else "-"
|
||||
tasks_detail.append(TaskInfo(
|
||||
title=t.title,
|
||||
title=t.title[:20],
|
||||
status=status.value,
|
||||
prestige=t.required_prestige,
|
||||
reward_dollars=t.reward_funds_cents / 100.0,
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import math
|
||||
from dataclasses import dataclass
|
||||
|
||||
from ..config.schema import WorldConfig
|
||||
|
|
@ -7,6 +8,18 @@ from ..db.models.company import Domain
|
|||
from .rng import RngStreams, sample_right_skew_triangular_int
|
||||
|
||||
_ALL_DOMAINS = list(Domain)
|
||||
_NUM_DOMAINS = len(_ALL_DOMAINS)
|
||||
|
||||
# Fixed tier composition for a 10-person startup.
|
||||
# Repeated to cover any employee count via modular indexing.
|
||||
_TIER_SEQUENCE = [
|
||||
"junior", "junior", "junior", "junior", "junior",
|
||||
"mid", "mid", "mid",
|
||||
"senior", "senior",
|
||||
]
|
||||
|
||||
_MIN_RATE = 1.0
|
||||
_MAX_RATE = 10.0
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
|
|
@ -22,16 +35,6 @@ def _salary_tiers(cfg):
|
|||
return (cfg.salary_junior, cfg.salary_mid, cfg.salary_senior)
|
||||
|
||||
|
||||
def _pick_tier_name(rng, cfg):
|
||||
x = rng.random()
|
||||
acc = 0.0
|
||||
for tier in _salary_tiers(cfg):
|
||||
acc += tier.share
|
||||
if acc >= x:
|
||||
return tier.name
|
||||
return _salary_tiers(cfg)[-1].name
|
||||
|
||||
|
||||
def _tier_by_name(cfg, tier_name):
|
||||
for tier in _salary_tiers(cfg):
|
||||
if tier.name == tier_name:
|
||||
|
|
@ -44,10 +47,49 @@ def _sample_salary_cents(rng, cfg, tier_name):
|
|||
return sample_right_skew_triangular_int(rng, tier.min_cents, tier.max_cents)
|
||||
|
||||
|
||||
def _sample_rates_by_domain(rng, cfg, tier_name):
|
||||
tier = _tier_by_name(cfg, tier_name)
|
||||
lo, hi = tier.rate_min, tier.rate_max
|
||||
return {domain: round(rng.uniform(lo, hi), 4) for domain in _ALL_DOMAINS}
|
||||
def _dirichlet_sample(rng, alpha, k):
|
||||
"""Sample from Dirichlet(alpha, ..., alpha) with k components."""
|
||||
raw = [rng.gammavariate(alpha, 1.0) for _ in range(k)]
|
||||
total = sum(raw)
|
||||
if total == 0:
|
||||
return [1.0 / k] * k
|
||||
return [x / total for x in raw]
|
||||
|
||||
|
||||
def _distribute_rates(rng, avg_rate, dirichlet_alpha=0.3):
|
||||
"""Distribute a rate budget across domains with spiky concentration.
|
||||
|
||||
Each domain gets at least _MIN_RATE. The extra budget is split via
|
||||
Dirichlet(alpha) so that one or two domains can be dramatically higher
|
||||
than the rest — a junior can secretly be a superstar in one domain.
|
||||
Individual rates are capped at _MAX_RATE.
|
||||
"""
|
||||
total_budget = avg_rate * _NUM_DOMAINS
|
||||
extra = total_budget - _NUM_DOMAINS * _MIN_RATE
|
||||
|
||||
if extra <= 0:
|
||||
return [_MIN_RATE] * _NUM_DOMAINS
|
||||
|
||||
proportions = _dirichlet_sample(rng, dirichlet_alpha, _NUM_DOMAINS)
|
||||
rates = [_MIN_RATE + extra * p for p in proportions]
|
||||
|
||||
# Cap at _MAX_RATE and redistribute excess iteratively.
|
||||
for _ in range(5):
|
||||
overflow = 0.0
|
||||
uncapped = []
|
||||
for i in range(_NUM_DOMAINS):
|
||||
if rates[i] > _MAX_RATE:
|
||||
overflow += rates[i] - _MAX_RATE
|
||||
rates[i] = _MAX_RATE
|
||||
else:
|
||||
uncapped.append(i)
|
||||
if overflow <= 0 or not uncapped:
|
||||
break
|
||||
share = overflow / len(uncapped)
|
||||
for i in uncapped:
|
||||
rates[i] += share
|
||||
|
||||
return [round(r, 4) for r in rates]
|
||||
|
||||
|
||||
def generate_employees(*, run_seed, count, cfg=None):
|
||||
|
|
@ -56,12 +98,25 @@ def generate_employees(*, run_seed, count, cfg=None):
|
|||
if count <= 0:
|
||||
return []
|
||||
|
||||
employees = []
|
||||
streams = RngStreams(run_seed)
|
||||
|
||||
# Build and shuffle tier assignments.
|
||||
tier_rng = streams.stream("tier_assignment")
|
||||
seq_len = len(_TIER_SEQUENCE)
|
||||
tiers = [_TIER_SEQUENCE[i % seq_len] for i in range(count)]
|
||||
tier_rng.shuffle(tiers)
|
||||
|
||||
employees = []
|
||||
for idx in range(1, count + 1):
|
||||
rng = streams.stream(f"employee_{idx}")
|
||||
tier_name = _pick_tier_name(rng, cfg)
|
||||
tier_name = tiers[idx - 1]
|
||||
tier_cfg = _tier_by_name(cfg, tier_name)
|
||||
|
||||
# Sample average rate uniformly within the tier's range.
|
||||
avg_rate = rng.uniform(tier_cfg.rate_min, tier_cfg.rate_max)
|
||||
|
||||
domain_rates = _distribute_rates(rng, avg_rate)
|
||||
rates = dict(zip(_ALL_DOMAINS, domain_rates))
|
||||
|
||||
employees.append(
|
||||
GeneratedEmployee(
|
||||
|
|
@ -69,7 +124,7 @@ def generate_employees(*, run_seed, count, cfg=None):
|
|||
work_hours_per_day=cfg.work_hours_per_day,
|
||||
salary_cents=_sample_salary_cents(rng, cfg, tier_name),
|
||||
tier=tier_name,
|
||||
rates_by_domain=_sample_rates_by_domain(rng, cfg, tier_name),
|
||||
rates_by_domain=rates,
|
||||
)
|
||||
)
|
||||
return employees
|
||||
|
|
|
|||
|
|
@ -8,13 +8,11 @@ from ..config.sampling import sample_from_spec
|
|||
from ..config.schema import WorldConfig
|
||||
from ..db.models.company import Domain
|
||||
from .rng import RngStreams, sample_without_replacement
|
||||
from .task_catalog import pick_task_text
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class GeneratedTask:
|
||||
title: str
|
||||
description: str
|
||||
required_prestige: int
|
||||
reward_funds_cents: int
|
||||
reward_prestige_delta: float
|
||||
|
|
@ -25,7 +23,7 @@ class GeneratedTask:
|
|||
deadline: datetime | None
|
||||
completed_at: datetime | None
|
||||
success: bool | None
|
||||
halfway_event_emitted: bool
|
||||
progress_milestone_pct: int
|
||||
requirements: dict[str, int]
|
||||
|
||||
|
||||
|
|
@ -71,18 +69,9 @@ def _sample_requirements(rng, cfg):
|
|||
return {domain: _sample_required_qty(rng, cfg) for domain in picked_domains}
|
||||
|
||||
|
||||
def _pick_title_desc(rng, primary_domain, serial):
|
||||
title, description = pick_task_text(rng, primary_domain)
|
||||
domain_str = primary_domain.value if hasattr(primary_domain, "value") else str(primary_domain)
|
||||
title = f"{title} [{domain_str.upper()}-{serial}]"
|
||||
return title, description
|
||||
|
||||
|
||||
def _make_task(rng, cfg, prestige, serial, requirements):
|
||||
title, description = _pick_title_desc(rng, next(iter(requirements)), serial)
|
||||
return GeneratedTask(
|
||||
title=title,
|
||||
description=description,
|
||||
title=f"Task-{serial}",
|
||||
required_prestige=prestige,
|
||||
reward_funds_cents=_sample_reward_funds_cents(rng, cfg, prestige=prestige),
|
||||
reward_prestige_delta=_sample_reward_prestige_delta(rng, cfg),
|
||||
|
|
@ -93,7 +82,7 @@ def _make_task(rng, cfg, prestige, serial, requirements):
|
|||
deadline=None,
|
||||
completed_at=None,
|
||||
success=None,
|
||||
halfway_event_emitted=False,
|
||||
progress_milestone_pct=0,
|
||||
requirements=requirements,
|
||||
)
|
||||
|
||||
|
|
@ -122,7 +111,6 @@ def build_task_rows(*, run_seed, count, cfg=None):
|
|||
for task in generated:
|
||||
task_rows.append({
|
||||
"title": task.title,
|
||||
"description": task.description,
|
||||
"required_prestige": task.required_prestige,
|
||||
"reward_funds_cents": task.reward_funds_cents,
|
||||
"reward_prestige_delta": task.reward_prestige_delta,
|
||||
|
|
@ -133,7 +121,7 @@ def build_task_rows(*, run_seed, count, cfg=None):
|
|||
"deadline": task.deadline,
|
||||
"completed_at": task.completed_at,
|
||||
"success": task.success,
|
||||
"halfway_event_emitted": task.halfway_event_emitted,
|
||||
"progress_milestone_pct": task.progress_milestone_pct,
|
||||
})
|
||||
for domain, qty in task.requirements.items():
|
||||
requirement_rows.append({
|
||||
|
|
|
|||
|
|
@ -63,6 +63,7 @@ def _seed_employees(db, company, req):
|
|||
id=uuid4(),
|
||||
company_id=company.id,
|
||||
name=emp.name,
|
||||
tier=emp.tier,
|
||||
work_hours_per_day=emp.work_hours_per_day,
|
||||
salary_cents=emp.salary_cents,
|
||||
)
|
||||
|
|
@ -86,7 +87,6 @@ def _seed_market_tasks(db, company, req):
|
|||
company_id=None,
|
||||
status=TaskStatus.MARKET,
|
||||
title=task.title,
|
||||
description=task.description,
|
||||
required_prestige=task.required_prestige,
|
||||
reward_funds_cents=task.reward_funds_cents,
|
||||
reward_prestige_delta=task.reward_prestige_delta,
|
||||
|
|
@ -95,7 +95,7 @@ def _seed_market_tasks(db, company, req):
|
|||
deadline=None,
|
||||
completed_at=None,
|
||||
success=None,
|
||||
halfway_event_emitted=False,
|
||||
progress_milestone_pct=0,
|
||||
)
|
||||
db.add(task_row)
|
||||
|
||||
|
|
|
|||
|
|
@ -1,365 +0,0 @@
|
|||
"""Realistic AI-startup task titles and descriptions, keyed by domain.
|
||||
|
||||
Each domain has a pool of (title, description) tuples. The generator picks
|
||||
from these deterministically using the seeded RNG, cycling if the pool is
|
||||
exhausted.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from ..db.models.company import Domain
|
||||
|
||||
TASK_POOL: dict[Domain, list[tuple[str, str]]] = {
|
||||
Domain.SYSTEM: [
|
||||
(
|
||||
"Set Up GPU-Aware K8s Cluster with Auto-Scaling",
|
||||
"Deploy a Kubernetes cluster with NVIDIA GPU operator, node auto-scaling based on inference queue depth, and spot instance fallback for training workloads.",
|
||||
),
|
||||
(
|
||||
"Build CI/CD Pipeline for ML Model Registry",
|
||||
"Create a CI pipeline that runs training validation, pushes versioned model artifacts to a registry, and auto-deploys to a staging inference endpoint.",
|
||||
),
|
||||
(
|
||||
"Implement Blue-Green Deployment for LLM Serving",
|
||||
"Set up zero-downtime model swaps for a vLLM serving cluster with automated rollback triggered by latency and error-rate thresholds.",
|
||||
),
|
||||
(
|
||||
"Deploy Observability Stack for AI Workloads",
|
||||
"Stand up Grafana, Prometheus, and OpenTelemetry with custom dashboards tracking GPU utilization, token throughput, time-to-first-token, and per-request cost.",
|
||||
),
|
||||
(
|
||||
"Terraform Multi-Region Inference Infrastructure",
|
||||
"Write IaC modules to provision inference endpoints across 3+ regions with global load balancing, failover routing, and centralized logging.",
|
||||
),
|
||||
(
|
||||
"Container Image Optimization for ML Serving",
|
||||
"Reduce Docker image sizes for PyTorch/CUDA serving containers from 15 GB to under 4 GB using multi-stage builds and distroless bases to cut cold-start times.",
|
||||
),
|
||||
(
|
||||
"Implement Secret Rotation and API Key Management",
|
||||
"Build an automated secret rotation system for API keys, database credentials, and model provider tokens across staging and production environments.",
|
||||
),
|
||||
(
|
||||
"Set Up Cost Monitoring and GPU Budget Alerts",
|
||||
"Integrate cloud billing APIs with a dashboard showing per-team GPU spend, cost-per-inference breakdowns, and automated alerts when daily spend exceeds thresholds.",
|
||||
),
|
||||
(
|
||||
"Build Canary Release Pipeline for Embedding Models",
|
||||
"Implement a canary deployment system that gradually shifts traffic to new embedding model versions, comparing retrieval quality metrics in real time.",
|
||||
),
|
||||
(
|
||||
"Migrate Inference Workloads to Serverless GPU",
|
||||
"Evaluate and migrate bursty inference workloads to serverless GPU providers, benchmarking cold-start latency against always-on instances.",
|
||||
),
|
||||
(
|
||||
"Implement Disaster Recovery for Training Checkpoints",
|
||||
"Design a cross-region checkpoint backup system with automated integrity verification, ensuring training runs can resume within 15 minutes of any single-region failure.",
|
||||
),
|
||||
(
|
||||
"Build Internal Developer Platform for ML Engineers",
|
||||
"Create a self-service portal where ML engineers can request GPU instances, spin up Jupyter environments, and launch training jobs without touching infrastructure.",
|
||||
),
|
||||
],
|
||||
Domain.RESEARCH: [
|
||||
(
|
||||
"Design Benchmark for Legal Document QA",
|
||||
"Create a benchmark suite of 2,000+ annotated legal questions across contract law and compliance, with human-expert baselines and an automated evaluation harness.",
|
||||
),
|
||||
(
|
||||
"Investigate MoE Routing for Multilingual Models",
|
||||
"Research and prototype alternative Mixture-of-Experts routing strategies that improve expert utilization for low-resource languages without degrading high-resource performance.",
|
||||
),
|
||||
(
|
||||
"Reproduce and Extend Speculative Decoding Results",
|
||||
"Replicate speculative decoding paper results on Llama-3 class models, then test novel draft model architectures that improve acceptance rates on code generation.",
|
||||
),
|
||||
(
|
||||
"Develop RAG Hallucination Detection Framework",
|
||||
"Build a systematic evaluation pipeline measuring faithfulness, relevance, and attribution accuracy for retrieval-augmented generation systems.",
|
||||
),
|
||||
(
|
||||
"Prototype LoRA Merging for Multi-Tenant Serving",
|
||||
"Research methods for dynamically composing multiple LoRA adapters at inference time, measuring quality degradation versus serving separate fine-tuned models.",
|
||||
),
|
||||
(
|
||||
"Benchmark Long-Context Retrieval Across 128K Models",
|
||||
"Systematically evaluate needle-in-a-haystack and multi-hop reasoning performance across frontier models at various context lengths with reproducible results.",
|
||||
),
|
||||
(
|
||||
"Investigate Synthetic Data Quality for Code Generation",
|
||||
"Develop automated quality scoring methods for synthetically generated code training data, correlating filter thresholds with downstream model performance.",
|
||||
),
|
||||
(
|
||||
"Research KV-Cache Compression Techniques",
|
||||
"Prototype and benchmark KV-cache eviction and quantization strategies for long-running conversational agents under fixed memory budgets.",
|
||||
),
|
||||
(
|
||||
"Build Ablation Study Framework for Prompt Engineering",
|
||||
"Create an experimentation harness for testing prompt variations across multiple models and tasks with statistical significance testing and cost tracking.",
|
||||
),
|
||||
(
|
||||
"Explore Constitutional AI for Domain-Specific Safety",
|
||||
"Adapt constitutional AI methods to create a self-improving safety filter for a healthcare chatbot, defining domain-specific principles and measuring accuracy.",
|
||||
),
|
||||
(
|
||||
"Develop Novel Chunking Strategies for Technical RAG",
|
||||
"Research and benchmark alternative document chunking methods—semantic, AST-aware, sliding window—specifically for API documentation and code repositories.",
|
||||
),
|
||||
(
|
||||
"Prototype Test-Time Compute Scaling for Math Reasoning",
|
||||
"Implement best-of-N sampling, tree search, and self-verification approaches for math reasoning, measuring the compute-accuracy Pareto frontier.",
|
||||
),
|
||||
],
|
||||
Domain.DATA: [
|
||||
(
|
||||
"Build Web Scraping Pipeline for Industry News Corpus",
|
||||
"Design a pipeline that crawls 50+ AI/tech news sources daily, deduplicates articles, extracts structured metadata, and loads clean text into a vector store.",
|
||||
),
|
||||
(
|
||||
"Create Annotation Platform for Dialogue Quality",
|
||||
"Build an annotation workflow where human raters score LLM conversation logs on helpfulness, accuracy, and safety, with inter-rater agreement tracking.",
|
||||
),
|
||||
(
|
||||
"Implement PII Detection and Redaction Pipeline",
|
||||
"Deploy a pipeline to detect and redact personally identifiable information from training data, with audit logging and configurable redaction strategies.",
|
||||
),
|
||||
(
|
||||
"Curate Instruction-Tuning Dataset from Internal Docs",
|
||||
"Extract, clean, and convert 10,000+ pages of internal documentation into high-quality instruction-response pairs suitable for fine-tuning.",
|
||||
),
|
||||
(
|
||||
"Build Data Quality Monitoring for Feature Store",
|
||||
"Implement data validation checks on streaming feature pipelines, alerting on schema drift, null-rate spikes, and distribution shifts before they affect models.",
|
||||
),
|
||||
(
|
||||
"Design ETL Pipeline for Multi-Modal Training Data",
|
||||
"Build a DAG pipeline that ingests images, PDFs, and structured data, applies OCR and layout detection, and produces unified records for vision-language training.",
|
||||
),
|
||||
(
|
||||
"Implement Deduplication for Large Text Corpora",
|
||||
"Deploy MinHash LSH-based near-deduplication at scale for 100M+ documents with configurable similarity thresholds and a review UI for borderline cases.",
|
||||
),
|
||||
(
|
||||
"Build Synthetic Data Pipeline for Rare Edge Cases",
|
||||
"Create a system that uses frontier LLMs to generate realistic synthetic examples for underrepresented categories in a classification dataset.",
|
||||
),
|
||||
(
|
||||
"Create Data Versioning and Lineage Tracking System",
|
||||
"Set up data versioning integrated with the ML training pipeline so every model checkpoint can be traced back to the exact dataset snapshot used.",
|
||||
),
|
||||
(
|
||||
"Build Customer Feedback Loop into Training Pipeline",
|
||||
"Implement a system where end-user thumbs-up/down signals are routed, reviewed, and selectively incorporated into fine-tuning datasets with human approval.",
|
||||
),
|
||||
(
|
||||
"Migrate Legacy Warehouse to ML-Ready Lakehouse",
|
||||
"Transform and migrate 5 years of product analytics data from a legacy SQL warehouse into a Parquet-based lakehouse optimized for feature engineering.",
|
||||
),
|
||||
],
|
||||
Domain.FRONTEND: [
|
||||
(
|
||||
"Build Interactive LLM Playground with Streaming",
|
||||
"Create a web app where users test multiple LLM providers side-by-side with streaming output, adjustable parameters, and conversation history persistence.",
|
||||
),
|
||||
(
|
||||
"Design Admin Dashboard for AI Agent Monitoring",
|
||||
"Build a dashboard showing real-time agent execution traces, tool call sequences, token usage graphs, and cost breakdowns with drill-down filtering.",
|
||||
),
|
||||
(
|
||||
"Create Document Chat Interface for RAG Product",
|
||||
"Implement a drag-and-drop document upload UI with a conversational interface showing source citations, confidence indicators, and reference highlighting.",
|
||||
),
|
||||
(
|
||||
"Build Annotation Review and Approval Interface",
|
||||
"Design a UI for data team leads to review annotator work, resolve disagreements, view agreement stats, and approve batches for training inclusion.",
|
||||
),
|
||||
(
|
||||
"Implement Prompt Management Studio",
|
||||
"Build a collaborative app where teams version, test, and A/B deploy prompt templates with visual diffs, rollback, and per-version performance analytics.",
|
||||
),
|
||||
(
|
||||
"Create Customer-Facing AI Usage Analytics Dashboard",
|
||||
"Build an embeddable dashboard showing API call volumes, latency percentiles, token consumption, and cost trends for enterprise customers.",
|
||||
),
|
||||
(
|
||||
"Build Visual Pipeline Editor for No-Code AI Workflows",
|
||||
"Create a node-based drag-and-drop editor where non-technical users chain data sources, LLM calls, and output actions into automated AI workflows.",
|
||||
),
|
||||
(
|
||||
"Design Chat Widget for Website Embedding",
|
||||
"Build a lightweight, brandable chat widget under 50 KB that customers embed on their sites, with streaming responses and escalation-to-human capability.",
|
||||
),
|
||||
(
|
||||
"Build Model Comparison Results Viewer",
|
||||
"Create a web interface displaying benchmark results across models in interactive tables and charts with filtering by task type and model size.",
|
||||
),
|
||||
(
|
||||
"Implement Real-Time Collaboration for AI Writing Tool",
|
||||
"Add multiplayer editing to an AI writing tool using CRDTs, with per-user cursors, AI suggestion tracking, and version history.",
|
||||
),
|
||||
(
|
||||
"Create Enterprise RAG Onboarding Wizard",
|
||||
"Build a step-by-step setup wizard guiding enterprise customers through connecting data sources, configuring chunking, testing retrieval, and deploying their endpoint.",
|
||||
),
|
||||
],
|
||||
Domain.BACKEND: [
|
||||
(
|
||||
"Build Multi-Tenant LLM Gateway with Rate Limiting",
|
||||
"Implement an API gateway that proxies requests to multiple LLM providers, enforces per-tenant rate limits, tracks usage, and handles automatic failover.",
|
||||
),
|
||||
(
|
||||
"Implement OAuth2 + SAML SSO for Enterprise Platform",
|
||||
"Add enterprise authentication supporting SAML 2.0, OIDC, and SCIM provisioning for customers integrating with their identity provider.",
|
||||
),
|
||||
(
|
||||
"Design Webhook System for Async AI Job Completion",
|
||||
"Build a reliable webhook delivery system with exponential backoff, signature verification, dead letter queue, and a webhook management API.",
|
||||
),
|
||||
(
|
||||
"Create Unified Embedding API with Caching Layer",
|
||||
"Build a microservice abstracting over multiple embedding providers with a Redis-backed cache, batch processing, and automatic model version migration.",
|
||||
),
|
||||
(
|
||||
"Build Conversation Memory Service for Multi-Session Agents",
|
||||
"Implement a service that stores, summarizes, and retrieves conversation history across sessions using structured storage and semantic vector search.",
|
||||
),
|
||||
(
|
||||
"Implement Usage-Based Billing with Stripe Integration",
|
||||
"Build a metering system that tracks token consumption per customer, aggregates monthly invoices, and syncs with Stripe for automated usage-based charging.",
|
||||
),
|
||||
(
|
||||
"Create Plugin Marketplace Backend",
|
||||
"Design the API and data model for a marketplace where third-party developers register, version, and distribute plugins for the AI platform.",
|
||||
),
|
||||
(
|
||||
"Build RAG Ingestion Service with Chunking and Indexing",
|
||||
"Implement an async document processing service that accepts PDFs, DOCX, and HTML, chunks them, generates embeddings, and upserts into a vector store.",
|
||||
),
|
||||
(
|
||||
"Implement Audit Logging and Compliance API",
|
||||
"Build a tamper-evident audit log system recording all AI interactions and admin actions, with an API for compliance queries and SOC 2 / HIPAA exports.",
|
||||
),
|
||||
(
|
||||
"Design Multi-Model Routing and Fallback Service",
|
||||
"Create a smart routing layer directing requests to the optimal model based on task complexity, latency requirements, and cost, with provider failover.",
|
||||
),
|
||||
(
|
||||
"Build File Processing Service for Vision-Language Models",
|
||||
"Implement an async service that accepts images and documents, runs them through vision-language models for extraction, and returns structured JSON output.",
|
||||
),
|
||||
(
|
||||
"Implement Streaming API with Server-Sent Events",
|
||||
"Build an SSE-based streaming endpoint for LLM responses with connection resumption, partial response caching, and graceful degradation.",
|
||||
),
|
||||
],
|
||||
Domain.TRAINING: [
|
||||
(
|
||||
"Fine-Tune Llama-3 8B for Domain-Specific Support",
|
||||
"Run supervised fine-tuning on 50K curated customer support conversations using QLoRA, targeting 15% accuracy improvement over the base model.",
|
||||
),
|
||||
(
|
||||
"Implement RLHF Pipeline for Code Generation Model",
|
||||
"Build an end-to-end RLHF pipeline with a reward model trained on human preference data and PPO training loop evaluated against HumanEval.",
|
||||
),
|
||||
(
|
||||
"Distill GPT-4 Class Model into Efficient 3B Model",
|
||||
"Use knowledge distillation with synthetic data to create a compact model retaining 90%+ teacher performance on targeted tasks at 10x lower inference cost.",
|
||||
),
|
||||
(
|
||||
"Train Custom Embedding Model for Vertical Search",
|
||||
"Fine-tune a sentence-transformers model on domain-specific query-document pairs with contrastive learning, hard negative mining, and retrieval benchmarks.",
|
||||
),
|
||||
(
|
||||
"Build Hyperparameter Search for Fine-Tuning Jobs",
|
||||
"Implement an Optuna-based HPO system searching over learning rate, LoRA rank, batch size, and data mixing ratios with early stopping.",
|
||||
),
|
||||
(
|
||||
"Run Continued Pre-Training on Proprietary Corpus",
|
||||
"Execute continued pre-training of a 7B base model on 10B tokens of domain-specific text with careful learning rate scheduling to avoid catastrophic forgetting.",
|
||||
),
|
||||
(
|
||||
"Train Reward Model from Preference Annotations",
|
||||
"Collect and process 20K pairwise preference annotations, train a Bradley-Terry reward model, and validate calibration against held-out human judgments.",
|
||||
),
|
||||
(
|
||||
"Build Multi-GPU Training Infra with DeepSpeed",
|
||||
"Set up distributed training using DeepSpeed ZeRO Stage 3 across an 8-node GPU cluster with checkpoint sharding and fault-tolerant resumption.",
|
||||
),
|
||||
(
|
||||
"Implement DPO Fine-Tuning Pipeline",
|
||||
"Build a Direct Preference Optimization pipeline as a simpler RLHF alternative, comparing quality and training stability on the same preference dataset.",
|
||||
),
|
||||
(
|
||||
"Train Vision-Language Adapter for Document Understanding",
|
||||
"Fine-tune a LoRA adapter on a VLM for extracting structured data from invoices, receipts, and forms with 95%+ field-level accuracy.",
|
||||
),
|
||||
(
|
||||
"Build Eval-Driven Training Loop with Auto Checkpointing",
|
||||
"Implement a training harness that runs benchmarks every N steps, auto-saves the best checkpoint, detects instability, and alerts on loss spikes.",
|
||||
),
|
||||
(
|
||||
"Fine-Tune Whisper for Industry-Specific Transcription",
|
||||
"Adapt Whisper-large for medical dictation using 500 hours of labeled audio, targeting 30% WER reduction on domain-specific terminology.",
|
||||
),
|
||||
],
|
||||
Domain.HARDWARE: [
|
||||
(
|
||||
"Optimize LLM Inference Latency with TensorRT-LLM",
|
||||
"Convert a 70B model to TensorRT-LLM with INT8/FP8 quantization, continuous batching, and paged attention, targeting sub-200ms time-to-first-token.",
|
||||
),
|
||||
(
|
||||
"Deploy On-Device ML Model for Mobile Classification",
|
||||
"Convert a PyTorch vision model to Core ML and TFLite, optimize with quantization-aware training, and benchmark on iPhone and Pixel hardware.",
|
||||
),
|
||||
(
|
||||
"Build GPU Cluster Scheduling with Fair-Share Queuing",
|
||||
"Implement a scheduler for a shared GPU cluster enforcing per-team quotas, priority queuing, preemption policies, and utilization-based chargeback.",
|
||||
),
|
||||
(
|
||||
"Implement Quantization Pipeline (GPTQ/AWQ/GGUF)",
|
||||
"Build an automated pipeline that takes any model, produces GPTQ, AWQ, and GGUF quantized variants, runs quality regression, and publishes passing models.",
|
||||
),
|
||||
(
|
||||
"Deploy Edge Inference for Real-Time Video Analytics",
|
||||
"Set up an NVIDIA Jetson-based inference node running YOLO and a lightweight LLM for on-premises real-time camera analysis with local data processing.",
|
||||
),
|
||||
(
|
||||
"Optimize vLLM Serving for Production Workload",
|
||||
"Profile and tune vLLM parameters—max batch size, KV cache, swap space, tensor parallelism—for target throughput at P99 latency SLA.",
|
||||
),
|
||||
(
|
||||
"Build Multi-GPU Inference with Tensor Parallelism",
|
||||
"Configure and benchmark a 70B+ model serving across 4-8 GPUs with tensor and pipeline parallelism, optimizing throughput versus latency tradeoffs.",
|
||||
),
|
||||
(
|
||||
"Implement Dynamic Batching for Inference Requests",
|
||||
"Build a request batching layer that groups incoming requests by sequence length and priority, maximizing GPU utilization within per-request latency SLAs.",
|
||||
),
|
||||
(
|
||||
"Design Hybrid CPU/GPU Inference Architecture",
|
||||
"Architect a system routing lightweight requests to CPU inference and complex requests to GPU instances, reducing overall compute cost by 40%.",
|
||||
),
|
||||
(
|
||||
"Set Up Triton Inference Server for Multi-Model Serving",
|
||||
"Deploy NVIDIA Triton to serve embedding, reranking, and generation models on shared GPU infrastructure with dynamic batching and concurrency control.",
|
||||
),
|
||||
(
|
||||
"Build GPU Health Monitoring and Failover System",
|
||||
"Implement a daemon detecting GPU memory errors, thermal throttling, and NVLink degradation, automatically draining affected nodes and redistributing workloads.",
|
||||
),
|
||||
(
|
||||
"Benchmark Specialized AI Accelerators vs H100",
|
||||
"Evaluate Groq, Cerebras, and custom ASICs against H100 GPUs, producing a cost-per-token and latency comparison with a migration recommendation.",
|
||||
),
|
||||
(
|
||||
"Implement Speculative Decoding in Production Stack",
|
||||
"Integrate speculative decoding with a small draft model into the existing serving infrastructure, measuring real-world throughput improvement.",
|
||||
),
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def pick_task_text(rng, domain: Domain) -> tuple[str, str]:
|
||||
"""Deterministically pick a (title, description) for *domain* using *rng*."""
|
||||
pool = TASK_POOL[domain]
|
||||
idx = rng.randint(0, len(pool) - 1)
|
||||
return pool[idx]
|
||||
Loading…
Add table
Add a link
Reference in a new issue