Merge pull request #7 from collinear-ai/feat/employee_tiers

Feat/employee tiers
This commit is contained in:
Adit Jain 2026-03-07 22:04:45 -08:00 committed by GitHub
commit b1cd7ebfb2
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
19 changed files with 7244 additions and 5336 deletions

BIN
.DS_Store vendored

Binary file not shown.

View file

@ -56,8 +56,8 @@ bash scripts/run_benchmark.sh --seed 1 --config hard
### Core loop
1. Agent calls `yc-bench sim resume` to advance time to the next event.
2. The engine flushes task progress, fires due events, applies payroll.
1. Agent calls `yc-bench sim resume` to advance time to the next event or monthly payroll.
2. The engine flushes task progress, applies prestige decay, fires due events, applies payroll.
3. Agent reads wake events and decides: accept tasks, assign employees, dispatch, cancel.
4. Repeat until bankruptcy or horizon end.
@ -65,12 +65,14 @@ The simulation ends on **bankruptcy** (funds < 0 after payroll), **horizon end**
### Key mechanics
- **Funds**: start at $250K. Monthly payroll is deducted automatically. Task rewards scale with prestige (`base × (1 + 0.55 × (prestige 1))`).
- **Funds**: starting capital varies by preset ($80K$250K). Monthly payroll is deducted automatically. Task rewards scale with prestige (`base × (1 + scale × (prestige 1))`).
- **4 domains**: `research · inference · data/environment · training`. Each domain tracks prestige independently in [1.0, 10.0].
- **Prestige gating**: tasks require a minimum prestige level. Most tasks need prestige 35, so the agent must climb from 1.0 by completing easier tasks first. First 10 market tasks are stratified `[1,1,1,1,2,2,2,3,3,4]` to bootstrap progression.
- **Per-domain prestige gating**: a task's required prestige is checked against **each** of its required domains. The agent must climb prestige broadly, not just in one domain.
- **Prestige decay**: every domain loses prestige daily. Neglected domains decay back toward 1.0. The agent must stay active across domains to maintain market access.
- **Prestige-scaled work volume**: higher-prestige tasks require proportionally more work. Higher prestige pays more but demands more capacity.
- **Employees**: 10 employees across 3 tiers (junior/mid/senior). The agent sees only each employee's tier and salary — not their per-domain skill rates. A junior can secretly be a superstar in one domain, so the agent must infer productivity from task progress observations.
- **Throughput splitting**: an employee assigned to N active tasks has `effective_rate = base_rate / N`. Focus beats breadth.
- **Task success**: on-time completion awards funds + prestige + skill boosts + 1% salary bump (compounding payroll pressure). Late completion penalises prestige (1.4×). Cancellation penalises harder (2.0×).
- **Task success**: on-time completion awards funds + prestige + skill boosts + 1% salary bump (compounding payroll pressure). Late completion penalises prestige. Cancellation penalises harder.
- **Progress checkpoints**: the agent is woken at 25%, 50%, 75%, and 100% completion — providing data points to estimate employee productivity.
- **Scratchpad**: persistent notes in the DB that survive context truncation (only last 20 conversation rounds are kept).
@ -92,7 +94,7 @@ yc-bench report monthly # P&L per month
yc-bench task accept --task-id UUID # pull from market
yc-bench task assign --task-id UUID --employee-id UUID
yc-bench task dispatch --task-id UUID # start work
yc-bench task cancel --task-id UUID --reason "" # cancel (2× prestige penalty)
yc-bench task cancel --task-id UUID --reason "" # cancel (prestige penalty)
yc-bench sim resume # advance time
yc-bench scratchpad write/append/clear # persistent memory
```
@ -103,13 +105,15 @@ yc-bench scratchpad write/append/clear # persistent memory
Experiment presets live in `src/yc_bench/config/presets/` as TOML files. Pass the preset name via `--config`.
| Config | Employees | Tasks | Tests |
|--------|-----------|-------|-------|
| **tutorial** | 3 | 50 | Basic accept→assign→dispatch loop |
| **easy** | 5 | 100 | Throughput awareness |
| **medium** | 5 | 150 | Prestige climbing + domain specialization |
| **hard** | 7 | 200 | Precise ETA reasoning |
| **nightmare** | 8 | 300 | Sustained perfection under compounding payroll |
All presets use 10 employees and 200 market tasks. Difficulty comes from deadline pressure, penalty severity, prestige distribution, and task size.
| Config | Deadline pressure | Prestige mode | What it tests |
|--------|------------------|---------------|---------------|
| **tutorial** | Very relaxed | 1 | Basic accept→assign→dispatch loop |
| **easy** | Relaxed | 1 | Throughput awareness |
| **medium** | Moderate | 3 | Prestige climbing + domain specialization |
| **hard** | Tight | 4 | Precise ETA reasoning + capacity planning |
| **nightmare** | Razor-thin | 5 | Sustained perfection under compounding payroll |
See `default.toml` for the full list of tunable parameters.
@ -117,44 +121,7 @@ See `default.toml` for the full list of tunable parameters.
## Benchmark results
### Sonnet 4.6 vs Gemini 3 Flash vs GPT-5.2 — 1-year horizon, 3 seeds per config
![3-model comparison](plots/sonnet_vs_gemini.png)
#### Survival rates
| Config | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
|--------|-----------|----------------|---------|
| **medium** | 3/3 | 3/3 | 3/3 |
| **hard** | 1/3 | 2/3 | 2/3 |
| **nightmare** | 1/3 | 3/3 | 2/3 |
#### Final funds (bankrupt = funds < 0)
| Config | Seed | Sonnet 4.6 | Gemini 3 Flash | GPT-5.2 |
|--------|------|-----------|----------------|---------|
| medium | 1 | **$9.1M** | **$9.5M** | **$1.8M** |
| medium | 2 | **$6.1M** | **$11.0M** | **$321K** |
| medium | 3 | **$107K** | **$15.8M** | **$28K** |
| hard | 1 | bankrupt | bankrupt | bankrupt |
| hard | 2 | **$63K** | **$412K** | **$15.7M** |
| hard | 3 | bankrupt | **$21.9M** | **$43.5M** |
| nightmare | 1 | bankrupt | **$2.1M** | bankrupt |
| nightmare | 2 | **$10.1M** | **$214K** | **$2.2M** |
| nightmare | 3 | bankrupt | **$805K** | **$23.6M** |
**Overall: Gemini 8/9 · GPT-5.2 7/9 · Sonnet 5/9**
#### Key findings
- **Gemini leads on consistency** (8/9 survival). The only model to sweep all 3 nightmare seeds.
- **GPT-5.2 has the highest ceiling.** Hard seed 3: $43.5M vs Gemini's $21.9M. When it survives, it tends to outperform by a wide margin.
- **Sonnet is high-variance.** Nightmare seed 2: $10.1M (best nightmare result), but 4/9 bankruptcies overall.
- **Win rate predicts survival.** Every run with >58% task win rate survived. Every run below 40% went bankrupt.
#### Prestige specialization
![Prestige radar](plots/prestige_radar.png)
*Results pending — re-running benchmarks with updated economics.*
---

Binary file not shown.

After

Width:  |  Height:  |  Size: 127 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 191 KiB

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -346,6 +346,7 @@ def run_bot(config_name: str, seed: int, bot_slug: str, strategy_fn: StrategyFn)
replacement = generate_replacement_task(
run_seed=sim_state.run_seed,
replenish_counter=counter,
replaced_prestige=best_task.required_prestige,
cfg=world_cfg,
)
replacement_row = Task(

View file

@ -26,7 +26,7 @@ DEFAULT_RUNS = [
{"label": "kimi-k2.5", "model_slug": "openrouter_moonshotai_kimi-k2.5", "color": "#2ecc71"},
]
INITIAL_FUNDS_CENTS = 25_000_000 # $250K
INITIAL_FUNDS_CENTS = 15_000_000 # $150K (default; presets may override)
def parse_args():
@ -129,7 +129,7 @@ def make_plot(run_data, seed, config_name, budget_usd, out_path: Path):
# ── Funds curves ─────────────────────────────────────────────────────────
ax_funds.axhline(0, color="#e74c3c", linewidth=0.9, linestyle="--", alpha=0.4, zorder=1)
ax_funds.axhline(250_000, color="#555577", linewidth=0.7, linestyle=":", alpha=0.6, zorder=1)
ax_funds.axhline(INITIAL_FUNDS_CENTS / 100, color="#555577", linewidth=0.7, linestyle=":", alpha=0.6, zorder=1)
for r in run_data:
if not r["times"]:

View file

@ -88,6 +88,7 @@ def task_accept(
replacement = generate_replacement_task(
run_seed=sim_state.run_seed,
replenish_counter=counter,
replaced_prestige=task.required_prestige,
cfg=_get_world_cfg(),
)

View file

@ -53,7 +53,7 @@ company_name = "BenchCo"
[world]
num_employees = 10
initial_funds_cents = 25_000_000 # $250,000
initial_funds_cents = 15_000_000 # $150,000
initial_prestige_level = 1.0
work_hours_per_day = 9.0
@ -78,9 +78,9 @@ penalty_cancel_multiplier = 2.0 # hardened: was 1.2
reward_prestige_scale = 0.55 # hardened: was 0.3
# Daily prestige decay per domain. Domains not exercised lose prestige
# over time: -0.01/day → -0.3/month. Untouched domain drops ~1 level
# every ~3 months. Prevents single-domain hyper-specialization.
prestige_decay_per_day = 0.01
# over time: -0.005/day → -0.15/month. Untouched domain drops ~1 level
# every ~6 months. Prevents single-domain hyper-specialization.
prestige_decay_per_day = 0.005
# Required qty scaling by prestige: qty *= 1 + scale * (prestige - 1).
# At 0.3: prestige-5 tasks need 2.2x the work of prestige-1 tasks.
@ -90,7 +90,7 @@ prestige_qty_scale = 0.3
# --- Deadline ---
# Deadline = max(deadline_min_biz_days, max_domain_qty / deadline_qty_per_day).
# Domains are worked in parallel, so deadline scales with heaviest domain, not sum.
deadline_qty_per_day = 150.0
deadline_qty_per_day = 200.0
deadline_min_biz_days = 7
# --- Progress milestones (checkpoint events at these completion fractions) ---
@ -120,12 +120,12 @@ high = 10
mode = 4 # hardened: base default is mode=1
# Base reward paid on task completion, in cents (scaled further by prestige).
# Higher-prestige tasks automatically pay more via reward_prestige_scale.
# Mode $14K: prestige-1 tasks burn cash, prestige-3 breaks even, prestige-4+ profits.
[world.dist.reward_funds_cents]
type = "triangular"
low = 500_000 # $5,000
high = 10_000_000 # $100,000
mode = 3_000_000 # $30,000
low = 300_000 # $3,000
high = 4_000_000 # $40,000
mode = 1_400_000 # $14,000
# Number of domains each task requires work in (cast to int after sampling).
# mode=2: most tasks need 2 domains — single-specialist dominance gone.
@ -139,9 +139,9 @@ mode = 2 # hardened: base default is mode=1
# No trivially-small tasks: every task requires sustained employee-hours.
[world.dist.required_qty]
type = "triangular"
low = 500 # hardened: base default is 200
high = 3000
mode = 1400 # hardened: base default is 800
low = 800 # hardened: base default is 200
high = 4000
mode = 2000 # hardened: base default is 800
# Prestige delta awarded per domain on task success.
# Mean ~0.1: climbing prestige 1→5 takes ~40 tasks.

View file

@ -28,10 +28,11 @@ horizon_years = 1
auto_advance_after_turns = 8
[world]
initial_funds_cents = 20_000_000 # $200,000
# Inherits num_employees=10, num_market_tasks=200 from default.
# Moderate deadlines: 60 qty/day → ~12 day deadline. Comfortable with 34 tasks.
deadline_qty_per_day = 60.0
# Moderate deadlines: 100 qty/day → 10-day deadline for mode task.
deadline_qty_per_day = 100.0
# Original (un-hardened) penalties — costly but not catastrophic.
penalty_fail_multiplier = 0.8
@ -55,6 +56,6 @@ value = 1 # Single-domain — the test is about throughput, not assignmen
[world.dist.required_qty]
type = "triangular"
low = 300
high = 1500
mode = 700 # Moderate size — a few days of focused work each.
low = 500
high = 2000
mode = 1000 # Larger tasks — must stay focused, no excessive parallelism.

View file

@ -40,12 +40,13 @@ horizon_years = 1
auto_advance_after_turns = 10
[world]
initial_funds_cents = 10_000_000 # $100,000 — must reach prestige 3 by month 5
# Inherits num_employees=10, num_market_tasks=200 from default.
# Tight deadlines: 1200/150 = 8 days.
# 1 task with 5 per domain → 5.8 days. OK.
# 2 concurrent tasks → 11.6 days. Miss.
deadline_qty_per_day = 150.0
# Tight deadlines: 2000/220 = 9.1 days.
# 1 task with 5 per domain → 8.7 days. Just fits.
# 2 concurrent tasks → 17.4 days. Guaranteed miss.
deadline_qty_per_day = 220.0
# Stiff penalties — mistakes cost real prestige.
penalty_fail_multiplier = 1.4
@ -71,6 +72,6 @@ mode = 2 # Most tasks need 2 domains.
[world.dist.required_qty]
type = "triangular"
low = 500
high = 2500
mode = 1200 # Large tasks — require sustained focus.
low = 1000
high = 4000
mode = 2000 # Large tasks — each takes ~9 days with full team. No parallelism.

View file

@ -38,10 +38,10 @@ auto_advance_after_turns = 8
[world]
# Inherits num_employees=10, num_market_tasks=200 from default.
# Deadline uses max per-domain qty. 900/100 = 9 days.
# 2 concurrent tasks: 5 per task → 4.3 days each. Manageable.
# 3 concurrent tasks: 3.3 per task → 6.6 days. Risky.
deadline_qty_per_day = 100.0
# Deadline uses max per-domain qty. 1500/150 = 10 days.
# 1 task with 5 per domain → 6.5 days. Comfortable.
# 2 concurrent tasks → 13 days. Miss.
deadline_qty_per_day = 150.0
# Real penalties — failing costs prestige, cancelling costs more.
penalty_fail_multiplier = 1.0
@ -67,6 +67,6 @@ mode = 2 # Most tasks need 2 domains.
[world.dist.required_qty]
type = "triangular"
low = 400
high = 2000
mode = 900 # Moderate work — completable in 712 days with focus.
low = 700
high = 3000
mode = 1500 # Larger tasks — ~6.5 days with full team, no parallelism.

View file

@ -49,12 +49,13 @@ horizon_years = 1
auto_advance_after_turns = 10
[world]
initial_funds_cents = 8_000_000 # $80,000 — razor-thin runway
# Inherits num_employees=10, num_market_tasks=200 from default.
# Razor deadlines: 1600/200 = 8 days.
# 1 task with 5 per domain → 7.7 days. Barely makes it.
# 2 concurrent tasks → guaranteed miss.
deadline_qty_per_day = 200.0
# Razor deadlines: 2500/220 = 11.4 days.
# 1 task with 5 per domain → 10.9 days. Barely fits.
# 2 concurrent tasks → 21.8 days. Guaranteed miss.
deadline_qty_per_day = 220.0
# Catastrophic penalties — there is no good exit from a bad accept.
penalty_fail_multiplier = 2.0
@ -81,9 +82,9 @@ mode = 2 # Mostly 2-domain, some 3-domain.
[world.dist.required_qty]
type = "triangular"
low = 600
high = 3000
mode = 1600 # Large work volumes — no quick wins.
low = 1200
high = 5000
mode = 2500 # Massive work volumes — each task consumes the full team.
# Slightly larger prestige gains than default (~0.13 avg) to make
# climbing feasible despite the steep penalty. But one blown task

View file

@ -28,10 +28,11 @@ horizon_years = 1
auto_advance_after_turns = 5
[world]
initial_funds_cents = 25_000_000 # $250,000 — very forgiving buffer
# Inherits num_employees=10, num_market_tasks=200 from default.
# Very generous deadlines: 30 qty/day → most tasks get 13+ day deadline.
deadline_qty_per_day = 30.0
# Generous deadlines: 50 qty/day → mode task gets 12-day deadline.
deadline_qty_per_day = 50.0
# Negligible penalties — mistakes barely hurt.
penalty_fail_multiplier = 0.3
@ -53,6 +54,6 @@ value = 1 # ALL tasks single-domain — trivial assignment.
[world.dist.required_qty]
type = "triangular"
low = 200
high = 800
mode = 400 # Small tasks, quick completions.
low = 300
high = 1200
mode = 600 # Moderate tasks, comfortable with focused execution.

View file

@ -39,7 +39,7 @@ class WorldDists(BaseModel):
)
# Base reward paid on task completion, in cents (result cast to int).
reward_funds_cents: DistSpec = Field(
default_factory=lambda: TriangularDist(low=500_000, high=10_000_000, mode=3_000_000)
default_factory=lambda: TriangularDist(low=300_000, high=4_000_000, mode=1_400_000)
)
# Number of domains required per task (result cast to int).
domain_count: DistSpec = Field(
@ -105,7 +105,7 @@ class SimConfig(BaseModel):
class WorldConfig(BaseModel):
# --- Workforce ---
num_employees: int = 10
initial_funds_cents: int = 25_000_000 # $250,000
initial_funds_cents: int = 15_000_000 # $150,000
initial_prestige_level: float = 1.0
work_hours_per_day: float = 9.0
@ -128,7 +128,7 @@ class WorldConfig(BaseModel):
# Daily prestige decay per domain. Domains not exercised lose prestige
# over time: -0.01/day → -0.3/month → untouched domain drops ~1 level
# every ~3 months. Floored at prestige_min.
prestige_decay_per_day: float = 0.01
prestige_decay_per_day: float = 0.005
# Required qty scaling by prestige: qty *= 1 + prestige_qty_scale * (prestige - 1).
# At 0.3: prestige-5 tasks need 2.2× the work of prestige-1 tasks.

View file

@ -178,6 +178,13 @@ def advance_time(
result.payrolls_applied += 1
payroll_idx += 1
# Report payroll as a wake event so the agent gets control back
company = db.query(Company).filter(Company.id == company_id).one()
result.wake_events.append({
"type": "monthly_payroll",
"funds_after": company.funds_cents,
})
if bankrupt:
# Insert bankruptcy event at this time
insert_event(
@ -188,7 +195,9 @@ def advance_time(
dedupe_key=f"bankruptcy:{current_time.isoformat()}",
)
result.bankrupt = True
break
# Always stop at payroll — gives the agent a chance to act
break
elif action_type == "event":
event_result = dispatch_event(db, next_event, current_time, company_id)

View file

@ -87,7 +87,7 @@ class Task(Base):
class TaskRequirement(Base):
__tablename__ = "task_requirements"
__table_args__ = (
CheckConstraint("required_qty >= 200 AND required_qty <= 3000", name="ck_task_requirements_required_qty_range"),
CheckConstraint("required_qty >= 200 AND required_qty <= 25000", name="ck_task_requirements_required_qty_range"),
CheckConstraint("completed_qty >= 0", name="ck_task_requirements_completed_qty_gte_0"),
CheckConstraint("completed_qty <= required_qty", name="ck_task_requirements_completed_qty_lte_required_qty"),
)

View file

@ -27,10 +27,9 @@ class GeneratedTask:
requirements: dict[str, int]
# First 10 market tasks are given explicit prestige values to guarantee a
# climbable ladder from the start (avoids runs where all early tasks need
# prestige 4+ before any are completable).
_STRATIFIED_PRESTIGE = [1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
# First 10 market tasks are forced to prestige 1 to guarantee a
# bootstrapping path regardless of the prestige distribution.
_STRATIFIED_PRESTIGE = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
_ALL_DOMAINS = list(Domain)
@ -134,14 +133,14 @@ def build_task_rows(*, run_seed, count, cfg=None):
return task_rows, requirement_rows
def generate_replacement_task(*, run_seed, replenish_counter, cfg=None):
def generate_replacement_task(*, run_seed, replenish_counter, replaced_prestige, cfg=None):
"""Generate a replacement task with the same prestige as the accepted task."""
if cfg is None:
cfg = WorldConfig()
streams = RngStreams(run_seed)
rng = streams.stream(f"replenish_{replenish_counter}")
prestige = _sample_required_prestige(rng, cfg)
requirements = _sample_requirements(rng, cfg, prestige=prestige)
return _make_task(rng, cfg, prestige, serial=replenish_counter, requirements=requirements)
requirements = _sample_requirements(rng, cfg, prestige=replaced_prestige)
return _make_task(rng, cfg, replaced_prestige, serial=replenish_counter, requirements=requirements)
__all__ = [