Add leaderboard visualization and documentation files

2026-04-19 12:58:09 +00:00 · 2025-11-17 23:01:25 -05:00 · 2025-11-17 23:01:25 -05:00 · 0e66c19b15
commit 0e66c19b15
parent aeda029a59
4 changed files with 640 additions and 0 deletions
--- a/leaderboard/BENCHMARK_GUIDE.md
+++ b/leaderboard/BENCHMARK_GUIDE.md
@ -0,0 +1,74 @@
+# Diplomacy Benchmark Guide
+
+## Single Model Benchmark
+
+```bash
+# Production: 20 games to 1925
+python run_benchmark.py --model_id "openai:gpt-4o" --friendly_name "gpt_4o"
+
+# Test: 3 games to 1901
+python run_benchmark.py --model_id "openai:gpt-4o" --friendly_name "test" --test
+
+# Baseline only
+python run_benchmark.py --model_id "..." --friendly_name "..." --baseline-only
+
+# Aggressive only
+python run_benchmark.py --model_id "..." --friendly_name "..." --aggressive-only
+```
+
+## Queue Multiple Models
+
+Edit `run_benchmark_queue.sh`:
+```bash
+MODELS=(
+    "model_id|friendly_name|baseline_only|aggressive_only"
+    "openai:gpt-4o|gpt_4o||"              # Both modes
+    "gemini-2.5-flash|gemini|true|"       # Baseline only
+    "claude-3.5-sonnet|claude||true"      # Aggressive only
+)
+```
+
+Run:
+```bash
+./run_benchmark_queue.sh
+```
+
+## Check Status
+
+```bash
+# Queue progress
+tail -f /tmp/benchmark_queue/queue.log
+
+# Individual model logs
+tail -f /tmp/benchmark_queue/{friendly_name}_{baseline|aggressive}.log
+
+# Running games (check FRANCE progress)
+ls -la results/{friendly_name}_{baseline|aggressive}/games/*/
+```
+
+## Visualize Results
+
+```bash
+# Generate comparison plots
+python leaderboard/full_comparison.py
+
+# View plots
+open leaderboard/*.png
+```
+
+## Key Locations
+
+| Item | Path |
+|------|------|
+| **Results** | `results/{friendly_name}_{baseline\|aggressive}/` |
+| **Leaderboard Links** | `leaderboard/{friendly_name}-{baseline\|aggressive}` |
+| **Queue Logs** | `/tmp/benchmark_queue/` |
+| **Prompts Baseline** | `prompts_benchmark/` |
+| **Prompts Aggressive** | `prompts_hold_reduction_v3/` |
+
+## Notes
+
+- Test model plays FRANCE (position 2, 0-indexed)
+- Opponents: devstral-small
+- Results auto-symlinked to leaderboard/
+- Benchmark handles retries, logging, error recovery
--- a/leaderboard/METHODOLOGY.md
+++ b/leaderboard/METHODOLOGY.md
@ -0,0 +1,146 @@
+# Diplomacy LLM Benchmark Methodology
+
+## Overview
+
+This benchmark evaluates Large Language Models (LLMs) on their ability to play the strategic board game Diplomacy, with a specific focus on measuring both **performance** (how well the model plays) and **steerability** (how responsive the model is to strategic prompt modifications). Each model plays as France against six identical opponent models (Mistral Devstral-Small) across multiple game iterations.
+
+## Performance Score Calculation
+
+### Game Score Formula
+
+The benchmark uses a modified DiploBench scoring system that rewards both survival and victory:
+
+- **Solo Winner**: `max_year + (max_year - win_year) + 18`
+  - Where `win_year` is when the power reached 18+ supply centers
+  - Rewards faster victories with bonus points
+
+- **Survivor** (active at game end): `max_year + final_supply_centers`
+  - Rewards territorial control for nations that survive to the end
+
+- **Eliminated**: `elimination_year`
+  - Year of elimination (relative to 1900)
+
+**Key parameters:**
+- `max_year`: Maximum game year (default: 1925) minus 1900 = 25
+- All years are normalized by subtracting 1900
+- Supply centers range from 0-34 (total centers on the board)
+
+### Aggregation
+
+Performance is calculated as the **mean game score across all iterations** for the test model playing as France. In the production benchmark:
+- 20 game iterations per experiment
+- Games run to 1925 (25 years of gameplay)
+- Parallel execution for efficiency
+
+### Why These Metrics Matter
+
+The game score formula captures:
+- **Territorial expansion**: Higher supply center counts indicate successful negotiation and military strategy
+- **Survival**: Avoiding elimination is critical in Diplomacy
+- **Victory speed**: Faster solo victories earn bonus points
+- **Competitive balance**: Scores are comparable across different game lengths
+
+## Steerability Score Calculation
+
+### Definition
+
+Steerability quantifies how much a model's behavior changes in response to modified strategic prompts. We compare two prompt variants:
+
+1. **Baseline prompts** (`prompts_benchmark`): Standard strategic guidance emphasizing balanced diplomacy
+2. **Aggressive prompts** (`prompts_hold_reduction_v3`): Modified prompts encouraging more aggressive territorial expansion and reduced defensive holds
+
+### Calculation Method
+
+```
+Steerability = Performance_aggressive - Performance_baseline
+```
+
+Where performance is measured by mean game score across all iterations.
+
+### Interpretation
+
+- **Positive steerability** (+): Model performs better with aggressive prompts, indicating successful behavioral adaptation
+- **Negative steerability** (−): Model performs worse with aggressive prompts, suggesting either:
+  - Failure to adapt to prompt modifications
+  - Over-aggressive behavior that undermines diplomatic strategy
+- **Near-zero steerability** (~0): Model behavior is largely invariant to prompt changes
+
+### Why Steerability Matters
+
+Steerability is a critical but under-explored dimension of LLM capability:
+- **Controllability**: Models should adapt their strategy when users provide different instructions
+- **Prompt sensitivity**: Measures whether models genuinely understand strategic guidance vs. memorized patterns
+- **Real-world utility**: Production systems need models that can be steered toward desired behaviors
+- **Safety implications**: Models that can't be steered may be difficult to align or control
+
+## Experimental Setup
+
+### Game Configuration
+- **Test position**: France (3rd position in 7-nation setup)
+- **Opponent model**: Mistral Devstral-Small (all 6 opposing nations)
+- **Max year**: 1925 (production) / 1901 (test mode)
+- **Iterations**: 20 games per experiment (production) / 3 games (test mode)
+- **Parallel workers**: 20 concurrent games (production) / 3 (test mode)
+
+### Prompt Variants
+- **Baseline**: `ai_diplomacy/prompts/prompts_benchmark`
+  - Standard diplomatic and strategic guidance
+  - Balanced approach to offense and defense
+
+- **Aggressive**: `ai_diplomacy/prompts/prompts_hold_reduction_v3`
+  - Emphasis on territorial expansion
+  - Reduction in defensive hold orders
+  - More assertive negotiation stance
+
+### Test Model Configuration
+- Models are inserted at position 2 (France) in the power order
+- Order: Austria, England, **France**, Germany, Italy, Russia, Turkey
+- All opponent positions use the same baseline opponent model
+
+## Data Collection
+
+### Per-Game Metrics
+For each game iteration, the system records:
+- Supply center progression by phase
+- Order types and success rates (moves, holds, supports, convoys)
+- Diplomatic messaging patterns
+- LLM response statistics and errors
+- Game outcome and final rankings
+
+### Aggregated Analysis
+The `statistical_game_analysis` module produces:
+- **Game-level CSV**: Per-game summaries with 88 metrics including game score, final supply centers, order statistics, and messaging patterns
+- **Phase-level CSV**: Turn-by-turn state including supply centers, military units, relationships, and sentiment
+- **Combined analysis**: Aggregated statistics across all game iterations
+
+### Performance Tracking
+- Console logs for each benchmark run
+- Metadata files documenting model IDs and configuration
+- Symlinks in `leaderboard/` directory for experiment discovery
+- Automated comparison visualizations
+
+## Limitations and Considerations
+
+### Statistical Validity
+- Sample size: 20 games per condition provides reasonable statistical power but may not capture rare outcomes
+- Opponent variance: Using a single opponent model reduces confounding factors but may not generalize to diverse opponents
+
+### Prompt Engineering
+- Steerability measurement is sensitive to prompt design quality
+- "Aggressive" may not be the optimal prompt modification for all models
+- Some models may benefit from different strategic guidance
+
+### Game Complexity
+- Diplomacy involves significant stochasticity from opponent behavior
+- Alliance formation and betrayal introduce non-deterministic outcomes
+- Long game duration (1901-1925) increases variance in results
+
+### Measurement Scope
+- Benchmark focuses on win condition (supply centers) rather than other strategic dimensions
+- Does not explicitly measure negotiation quality, deception capability, or long-term planning
+- Performance as a single nation (France) may not generalize to other starting positions
+
+### Technical Constraints
+- Model errors and API failures can affect game completion rates
+- Longer generation times may indicate different reasoning patterns but are not directly scored
+- Prompt formatting differences across model providers may introduce artifacts
--- a/leaderboard/SCORING_EXPLAINED.md
+++ b/leaderboard/SCORING_EXPLAINED.md
@ -0,0 +1,217 @@
+# Scoring Methodology Explanation
+
+This document explains the two different scoring systems used in the AI Diplomacy benchmark and how they are applied across different metrics and visualizations.
+
+## Two Scoring Systems Used
+
+### 1. Raw Supply Center Count
+
+**What it is:** The simple count of supply centers a power controls at the end of the game.
+
+**How it's calculated:**
+- Directly counts the number of supply centers (territories that can produce units) owned by a power when the game ends
+- Range: 0-34 supply centers (theoretical maximum, though 18 is winning threshold)
+- No bonuses or penalties applied
+
+**Example:**
+- France ends game with 11 supply centers → Raw SC score = 11
+- France eliminated with 0 supply centers → Raw SC score = 0
+- France wins solo with 18+ supply centers → Raw SC score = 18-34
+
+**When it's used:**
+- Supplementary metric in both CSV files for transparency
+- Useful for understanding territorial control independent of game outcome timing
+- Easier to interpret for quick comparisons
+
+### 2. Custom Game Score (DiploBench-style)
+
+**What it is:** A sophisticated scoring system that rewards survival, victory speed, and penalizes early elimination.
+
+**How it's calculated:**
+
+The formula depends on the game outcome:
+
+#### Case 1: Solo Winner (18+ supply centers)
+```
+score = max_year + (max_year - win_year) + 18
+```
+- `max_year`: Maximum game year (typically 1925) minus 1900 = 25
+- `win_year`: Year when 18 supply centers were reached minus 1900
+- The `(max_year - win_year)` term rewards faster victories
+- The `+ 18` bonus represents the solo win achievement
+
+**Example:**
+- Game ends in 1925 (max_year = 25)
+- France wins solo in 1920 (win_year = 20)
+- Score = 25 + (25 - 20) + 18 = 25 + 5 + 18 = **48 points**
+- If France won in 1915 instead: 25 + 10 + 18 = **53 points** (faster win = higher score)
+
+#### Case 2: Survivor (active at game end, no solo winner)
+```
+score = max_year + final_supply_centers
+```
+
+**Example:**
+- Game ends in 1925 (max_year = 25)
+- France survives with 11 supply centers
+- Score = 25 + 11 = **36 points**
+- If France had 8 supply centers: 25 + 8 = **33 points**
+
+#### Case 3: Eliminated (0 supply centers before game end)
+```
+score = elimination_year
+```
+- `elimination_year`: Year of elimination minus 1900
+
+**Example:**
+- France eliminated in 1910
+- Score = 1910 - 1900 = **10 points**
+- If eliminated in 1905: score = **5 points** (earlier elimination = lower score)
+
+#### Case 4: Lost to Solo Winner (still had units but someone else won)
+```
+score = win_year
+```
+- Same as elimination year but marked differently
+
+**Key Properties:**
+- **Winning range:** Typically 42-70+ points (varies by win speed)
+- **Survival range:** Typically 28-42 points (25 + low SCs to 25 + 17 SCs)
+- **Elimination range:** Typically 5-24 points (early elimination to late elimination)
+
+## Which Score is Used Where
+
+### France Bar Chart Visualization (`france_scores_bar.png`)
+
+**Score used:** Custom Game Score (DiploBench-style)
+
+**Rationale:**
+- The bar chart is titled "Average Game Score" and uses the `game_score` column
+- This provides a more nuanced view of performance that accounts for survival time and victory quality
+- Better differentiates between "survived weakly" vs "eliminated late" vs "won fast"
+
+**Code reference:** `/Users/alxdfy/Documents/mldev/AI_Diplomacy/analyze_diplomacy_performance_v3_textured.py`, lines 595-596
+
+### Overall Performance Leaderboard (`overall_performance.csv`)
+
+**Primary metric:** Custom Game Score (columns: `france_mean_score`, `france_median_score`)
+
+**Secondary metric:** Raw Supply Centers (columns: `raw_supply_centers_mean`, `raw_supply_centers_median`)
+
+**Why both are included:**
+- **Game Score** is the primary ranking metric because it better captures overall gameplay quality
+- **Raw Supply Centers** provides context and transparency about territorial control
+- Together they give a complete picture: a model might have high territorial control but poor survival time, or vice versa
+
+**Example interpretation:**
+- Model A: game_score = 45, raw_sc = 15 → Likely survived to 1925 with moderate territory
+- Model B: game_score = 15, raw_sc = 8 → Likely eliminated early despite having reasonable territory at that point
+- Model C: game_score = 55, raw_sc = 19 → Likely won solo or dominated late game
+
+### Steerability Leaderboard (`steerability.csv`)
+
+**Primary metric:** Custom Game Score difference (columns: `steerability_score`, `steerability_percentage`)
+
+**Secondary metric:** Raw Supply Center difference (columns: `steerability_score_raw`, `steerability_percentage_raw`)
+
+**Why both are included:**
+- **Game Score steerability** measures the true impact of aggressive prompting on overall performance
+- **Raw SC steerability** shows the pure territorial control difference
+- Models can be steerable in different ways:
+  - High game score steerability + low raw SC steerability → Better survival/timing, not just territory
+  - High raw SC steerability + lower game score steerability → More territory but possibly worse timing
+
+**Example interpretation:**
+- Model X: steerability_score = +15, steerability_score_raw = +10
+  - Aggressive prompting adds 15 game score points and 10 supply centers
+  - The difference (15 vs 10) suggests better survival timing in addition to more territory
+
+- Model Y: steerability_score = -5, steerability_score_raw = -2
+  - Aggressive prompting actually hurts performance (negative steerability)
+  - Loses 5 game score points and 2 supply centers
+  - Model performs better with neutral/baseline prompting
+
+## Why Both Scores Matter
+
+### Complementary Insights
+
+1. **Game Score** captures:
+   - Victory quality (how fast)
+   - Survival duration (eliminated when)
+   - Overall strategic success
+   - Risk-reward tradeoffs
+
+2. **Raw Supply Centers** captures:
+   - Territorial expansion ability
+   - Pure diplomatic/military success
+   - Easier to interpret and compare
+   - Independent of timing considerations
+
+### Research Value
+
+Having both metrics allows researchers to:
+- **Identify interesting patterns:** A model might be excellent at gaining territory (high raw SC) but poor at survival (low game score)
+- **Understand steerability mechanisms:** Does aggressive prompting lead to more territory, better timing, or both?
+- **Compare fairly:** Different use cases might prioritize different aspects of performance
+- **Validate results:** If both metrics agree, it strengthens confidence in the findings
+
+### Example Use Cases
+
+**Use Case 1: Deployment Decision**
+- If you want a model for long games → prioritize `game_score` (measures endurance)
+- If you want a model for territorial expansion → consider `raw_supply_centers` (measures conquest)
+
+**Use Case 2: Steerability Analysis**
+- High `steerability_score` but low `steerability_score_raw` → Model becomes more strategic with timing
+- Similar values → Steerability primarily affects territorial control
+- Negative values → Model responds poorly to aggressive prompting
+
+## Column Definitions
+
+### overall_performance.csv
+
+| Column | Definition |
+|--------|------------|
+| `model_name` | Base name of the model being benchmarked |
+| `best_variant` | Which variant (baseline or aggressive) performed better |
+| `france_mean_score` | Average Custom Game Score across all games (primary ranking metric) |
+| `france_median_score` | Median Custom Game Score across all games (robust to outliers) |
+| `raw_supply_centers_mean` | Average raw supply center count at game end |
+| `raw_supply_centers_median` | Median raw supply center count at game end |
+| `france_win_rate` | Percentage of games won (18+ supply centers) |
+| `total_games` | Number of games played |
+| `avg_phase_time_minutes` | Average time per game phase in minutes |
+| `error_rate` | Percentage of API/LLM errors during gameplay |
+
+**Sorting:** Descending by `france_mean_score` (higher is better)
+
+### steerability.csv
+
+| Column | Definition |
+|--------|------------|
+| `model_name` | Base name of the model being analyzed |
+| `baseline_mean_score` | Average Custom Game Score for baseline (neutral) prompting |
+| `aggressive_mean_score` | Average Custom Game Score for aggressive prompting |
+| `baseline_raw_supply_centers` | Average raw supply centers for baseline prompting |
+| `aggressive_raw_supply_centers` | Average raw supply centers for aggressive prompting |
+| `steerability_score` | Difference in Custom Game Score (aggressive - baseline) |
+| `steerability_percentage` | Percentage change in Custom Game Score ((aggressive - baseline) / baseline * 100) |
+| `steerability_score_raw` | Difference in raw supply centers (aggressive - baseline) |
+| `steerability_percentage_raw` | Percentage change in raw supply centers ((aggressive - baseline) / baseline * 100) |
+| `direction` | "positive" if aggressive improves performance, "negative" if it hurts |
+| `baseline_games` | Number of baseline games played |
+| `aggressive_games` | Number of aggressive games played |
+
+**Sorting:** Descending by `steerability_score` (higher positive values = more steerable toward aggression)
+
+## Summary
+
+- **Primary Metric:** Custom Game Score (DiploBench-style) - used for all rankings and bar chart
+- **Secondary Metric:** Raw Supply Centers - provided for transparency and complementary analysis
+- **France Bar Chart:** Uses Custom Game Score
+- **Steerability:** Measures both metrics to understand different aspects of prompt influence
+- **Both metrics together** provide a complete picture of model performance and behavior
+
+For questions about the scoring implementation, see:
+- Game score calculation: `/Users/alxdfy/Documents/mldev/AI_Diplomacy/experiment_runner/analysis/summary.py` (lines 77-118)
+- Data collection: `/Users/alxdfy/Documents/mldev/AI_Diplomacy/analyze_diplomacy_performance_v3_textured.py` (lines 577-611)
--- a/leaderboard/full_comparison.py
+++ b/leaderboard/full_comparison.py
@ -0,0 +1,203 @@
+#!/usr/bin/env python3
+"""
+Auto-discovery leaderboard comparison script.
+
+Automatically discovers all experiments in the leaderboard/ directory,
+groups them by model name (baseline vs aggressive variants), and generates
+comprehensive comparison visualizations.
+
+Naming convention: {model_name}-baseline and {model_name}-aggressive
+Example: gpt_5_medium-baseline, gpt_5_medium-aggressive
+"""
+
+import json
+import subprocess
+from pathlib import Path
+import sys
+
+def discover_experiments(leaderboard_dir="."):
+    """
+    Discover all experiments in the leaderboard directory.
+
+    Returns:
+        dict: Mapping of display labels to absolute paths
+    """
+    leaderboard_path = Path(leaderboard_dir)
+
+    if not leaderboard_path.exists():
+        print(f"Error: {leaderboard_dir} directory not found")
+        sys.exit(1)
+
+    experiments = {}
+
+    # Scan for all directories/symlinks in leaderboard folder
+    for item in sorted(leaderboard_path.iterdir()):
+        # Skip the script itself, output directory, temp files, and log files
+        if item.name in ['full_comparison.py', 'leaderboard_comparison', 'temp_leaderboard_paths.json'] or item.name.endswith('.log'):
+            continue
+
+        if item.is_dir() or item.is_symlink():
+            # Get the name and resolve symlink if needed
+            name = item.name
+            resolved_path = item.resolve()
+
+            # Create display label: replace underscores with spaces, capitalize words
+            # e.g., gpt_5_medium-baseline -> GPT-5-medium (baseline)
+            # e.g., sonoma_sky-aggressive -> Sonoma-sky (aggressive)
+
+            if '-baseline' in name:
+                model_name = name.replace('-baseline', '').replace('_', '-')
+                display_label = f"{model_name}"
+            elif '-aggressive' in name:
+                model_name = name.replace('-aggressive', '').replace('_', '-')
+                display_label = f"{model_name}-aggressive"
+            else:
+                # Fallback for any non-standard naming
+                display_label = name.replace('_', '-')
+
+            experiments[display_label] = str(resolved_path)
+
+    return experiments
+
+def main():
+    print("=== Leaderboard Auto-Discovery Comparison ===\n")
+
+    # Get absolute paths for script location and parent directory
+    script_dir = Path(__file__).resolve().parent
+    parent_dir = script_dir.parent
+
+    # Discover all experiments (using absolute path to leaderboard dir)
+    print("Discovering experiments in leaderboard directory...")
+    experiments = discover_experiments(script_dir)
+
+    if not experiments:
+        print("No experiments found in leaderboard directory")
+        return
+
+    print(f"Found {len(experiments)} experiments:")
+    for label in sorted(experiments.keys()):
+        print(f"  - {label}")
+    print()
+
+    # Create output directory (absolute path)
+    output_dir = script_dir / "leaderboard_comparison"
+    output_dir.mkdir(exist_ok=True)
+    print(f"Output directory: {output_dir}/\n")
+
+    # Import visualization functions from parent directory
+    print("Loading analysis modules...")
+    sys.path.insert(0, str(parent_dir))
+    from analyze_diplomacy_performance_v3_textured import (
+        collect_timing_data_v3, create_stacked_bar_chart_v3,
+        collect_move_type_data, create_move_type_chart,
+        collect_error_data, create_error_chart
+    )
+
+    # Collect and generate timing analysis
+    print("\nCollecting timing data (concurrent operations)...")
+    timing_data = collect_timing_data_v3(experiments)
+    if timing_data:
+        create_stacked_bar_chart_v3(timing_data, output_dir / "phase_timing_comparison.png")
+        print("  ✓ Generated phase_timing_comparison.png")
+
+    # Collect and generate move type analysis
+    print("\nCollecting move type data...")
+    move_data = collect_move_type_data(experiments)
+    if move_data:
+        create_move_type_chart(move_data, output_dir / "move_type_comparison.png")
+        print("  ✓ Generated move_type_comparison.png")
+
+    # Collect and generate error analysis
+    print("\nCollecting error data...")
+    error_data = collect_error_data(experiments)
+    if error_data:
+        create_error_chart(error_data, output_dir / "error_comparison.png")
+        print("  ✓ Generated error_comparison.png")
+
+    # Import additional analysis functions
+    print("\nGenerating additional visualizations...")
+    from analyze_diplomacy_performance_v3_textured import (
+        collect_france_scores,
+        plot_france_scores_bar,
+        plot_france_scores_box,
+        plot_diplomatic_credit_heatmap,
+        plot_relative_sentiment
+    )
+
+    # Reverse mapping for compatibility with existing functions
+    path_to_label = {v: k for k, v in experiments.items()}
+
+    # Generate remaining visualizations directly (no subprocess needed)
+    try:
+        # Collect France scores once and reuse
+        print("  Collecting France game scores...")
+        all_scores_df = collect_france_scores(path_to_label)
+
+        if not all_scores_df.empty:
+            print("  Generating score visualizations...")
+            plot_france_scores_bar(all_scores_df, output_dir / "france_scores_bar.png")
+            plot_france_scores_box(all_scores_df, output_dir / "france_scores_box.png")
+            print("  ✓ Generated score visualizations")
+
+        # Generate diplomatic heatmaps
+        print("  Generating diplomatic heatmaps...")
+        plot_diplomatic_credit_heatmap(path_to_label, "other_to_france", output_dir / "heatmap_other_to_france.png")
+        plot_diplomatic_credit_heatmap(path_to_label, "france_to_other", output_dir / "heatmap_france_to_other.png")
+        print("  ✓ Generated heatmaps")
+
+        # Generate relative sentiment chart
+        print("  Generating relative sentiment chart...")
+        plot_relative_sentiment(path_to_label, output_dir / "relative_sentiment.png")
+        print("  ✓ Generated relative sentiment chart")
+
+        print("  ✓ Generated additional visualizations")
+
+    except Exception as e:
+        print(f"  ⚠ Warning: Some visualizations may have failed: {e}")
+
+    # List generated files
+    files = sorted(output_dir.glob("*.png"))
+    print(f"\n=== Complete! ===")
+    print(f"Generated {len(files)} visualizations in {output_dir}/:")
+    for f in files:
+        print(f"  - {f.name}")
+
+    # Print summary by model family
+    print("\n=== Experiments by Family ===")
+
+    families = {
+        'GPT-5': [],
+        'GPT-OSS': [],
+        'O-series': [],
+        'Gemini': [],
+        'Hermes': [],
+        'Sonoma': [],
+        'Other': []
+    }
+
+    for label in sorted(experiments.keys()):
+        if 'gpt-5' in label.lower():
+            families['GPT-5'].append(label)
+        elif 'gpt-oss' in label.lower():
+            families['GPT-OSS'].append(label)
+        elif label.startswith('o3') or label.startswith('o4'):
+            families['O-series'].append(label)
+        elif 'gemini' in label.lower():
+            families['Gemini'].append(label)
+        elif 'hermes' in label.lower():
+            families['Hermes'].append(label)
+        elif 'sonoma' in label.lower():
+            families['Sonoma'].append(label)
+        else:
+            families['Other'].append(label)
+
+    for family, models in families.items():
+        if models:
+            print(f"\n{family}:")
+            for model in models:
+                print(f"  - {model}")
+
+    print("\n✓ Leaderboard comparison complete!")
+
+if __name__ == "__main__":
+    main()