diff --git a/leaderboard/BENCHMARK_GUIDE.md b/leaderboard/BENCHMARK_GUIDE.md new file mode 100644 index 0000000..a28ecf6 --- /dev/null +++ b/leaderboard/BENCHMARK_GUIDE.md @@ -0,0 +1,74 @@ +# Diplomacy Benchmark Guide + +## Single Model Benchmark + +```bash +# Production: 20 games to 1925 +python run_benchmark.py --model_id "openai:gpt-4o" --friendly_name "gpt_4o" + +# Test: 3 games to 1901 +python run_benchmark.py --model_id "openai:gpt-4o" --friendly_name "test" --test + +# Baseline only +python run_benchmark.py --model_id "..." --friendly_name "..." --baseline-only + +# Aggressive only +python run_benchmark.py --model_id "..." --friendly_name "..." --aggressive-only +``` + +## Queue Multiple Models + +Edit `run_benchmark_queue.sh`: +```bash +MODELS=( + "model_id|friendly_name|baseline_only|aggressive_only" + "openai:gpt-4o|gpt_4o||" # Both modes + "gemini-2.5-flash|gemini|true|" # Baseline only + "claude-3.5-sonnet|claude||true" # Aggressive only +) +``` + +Run: +```bash +./run_benchmark_queue.sh +``` + +## Check Status + +```bash +# Queue progress +tail -f /tmp/benchmark_queue/queue.log + +# Individual model logs +tail -f /tmp/benchmark_queue/{friendly_name}_{baseline|aggressive}.log + +# Running games (check FRANCE progress) +ls -la results/{friendly_name}_{baseline|aggressive}/games/*/ +``` + +## Visualize Results + +```bash +# Generate comparison plots +python leaderboard/full_comparison.py + +# View plots +open leaderboard/*.png +``` + +## Key Locations + +| Item | Path | +|------|------| +| **Results** | `results/{friendly_name}_{baseline\|aggressive}/` | +| **Leaderboard Links** | `leaderboard/{friendly_name}-{baseline\|aggressive}` | +| **Queue Logs** | `/tmp/benchmark_queue/` | +| **Prompts Baseline** | `prompts_benchmark/` | +| **Prompts Aggressive** | `prompts_hold_reduction_v3/` | + +## Notes + +- Test model plays FRANCE (position 2, 0-indexed) +- Opponents: devstral-small +- Results auto-symlinked to leaderboard/ +- Benchmark handles retries, logging, error recovery \ No newline at end of file diff --git a/leaderboard/METHODOLOGY.md b/leaderboard/METHODOLOGY.md new file mode 100644 index 0000000..1ec4977 --- /dev/null +++ b/leaderboard/METHODOLOGY.md @@ -0,0 +1,146 @@ +# Diplomacy LLM Benchmark Methodology + +## Overview + +This benchmark evaluates Large Language Models (LLMs) on their ability to play the strategic board game Diplomacy, with a specific focus on measuring both **performance** (how well the model plays) and **steerability** (how responsive the model is to strategic prompt modifications). Each model plays as France against six identical opponent models (Mistral Devstral-Small) across multiple game iterations. + +## Performance Score Calculation + +### Game Score Formula + +The benchmark uses a modified DiploBench scoring system that rewards both survival and victory: + +- **Solo Winner**: `max_year + (max_year - win_year) + 18` + - Where `win_year` is when the power reached 18+ supply centers + - Rewards faster victories with bonus points + +- **Survivor** (active at game end): `max_year + final_supply_centers` + - Rewards territorial control for nations that survive to the end + +- **Eliminated**: `elimination_year` + - Year of elimination (relative to 1900) + +**Key parameters:** +- `max_year`: Maximum game year (default: 1925) minus 1900 = 25 +- All years are normalized by subtracting 1900 +- Supply centers range from 0-34 (total centers on the board) + +### Aggregation + +Performance is calculated as the **mean game score across all iterations** for the test model playing as France. In the production benchmark: +- 20 game iterations per experiment +- Games run to 1925 (25 years of gameplay) +- Parallel execution for efficiency + +### Why These Metrics Matter + +The game score formula captures: +- **Territorial expansion**: Higher supply center counts indicate successful negotiation and military strategy +- **Survival**: Avoiding elimination is critical in Diplomacy +- **Victory speed**: Faster solo victories earn bonus points +- **Competitive balance**: Scores are comparable across different game lengths + +## Steerability Score Calculation + +### Definition + +Steerability quantifies how much a model's behavior changes in response to modified strategic prompts. We compare two prompt variants: + +1. **Baseline prompts** (`prompts_benchmark`): Standard strategic guidance emphasizing balanced diplomacy +2. **Aggressive prompts** (`prompts_hold_reduction_v3`): Modified prompts encouraging more aggressive territorial expansion and reduced defensive holds + +### Calculation Method + +``` +Steerability = Performance_aggressive - Performance_baseline +``` + +Where performance is measured by mean game score across all iterations. + +### Interpretation + +- **Positive steerability** (+): Model performs better with aggressive prompts, indicating successful behavioral adaptation +- **Negative steerability** (−): Model performs worse with aggressive prompts, suggesting either: + - Failure to adapt to prompt modifications + - Over-aggressive behavior that undermines diplomatic strategy +- **Near-zero steerability** (~0): Model behavior is largely invariant to prompt changes + +### Why Steerability Matters + +Steerability is a critical but under-explored dimension of LLM capability: +- **Controllability**: Models should adapt their strategy when users provide different instructions +- **Prompt sensitivity**: Measures whether models genuinely understand strategic guidance vs. memorized patterns +- **Real-world utility**: Production systems need models that can be steered toward desired behaviors +- **Safety implications**: Models that can't be steered may be difficult to align or control + +## Experimental Setup + +### Game Configuration +- **Test position**: France (3rd position in 7-nation setup) +- **Opponent model**: Mistral Devstral-Small (all 6 opposing nations) +- **Max year**: 1925 (production) / 1901 (test mode) +- **Iterations**: 20 games per experiment (production) / 3 games (test mode) +- **Parallel workers**: 20 concurrent games (production) / 3 (test mode) + +### Prompt Variants +- **Baseline**: `ai_diplomacy/prompts/prompts_benchmark` + - Standard diplomatic and strategic guidance + - Balanced approach to offense and defense + +- **Aggressive**: `ai_diplomacy/prompts/prompts_hold_reduction_v3` + - Emphasis on territorial expansion + - Reduction in defensive hold orders + - More assertive negotiation stance + +### Test Model Configuration +- Models are inserted at position 2 (France) in the power order +- Order: Austria, England, **France**, Germany, Italy, Russia, Turkey +- All opponent positions use the same baseline opponent model + +## Data Collection + +### Per-Game Metrics +For each game iteration, the system records: +- Supply center progression by phase +- Order types and success rates (moves, holds, supports, convoys) +- Diplomatic messaging patterns +- LLM response statistics and errors +- Game outcome and final rankings + +### Aggregated Analysis +The `statistical_game_analysis` module produces: +- **Game-level CSV**: Per-game summaries with 88 metrics including game score, final supply centers, order statistics, and messaging patterns +- **Phase-level CSV**: Turn-by-turn state including supply centers, military units, relationships, and sentiment +- **Combined analysis**: Aggregated statistics across all game iterations + +### Performance Tracking +- Console logs for each benchmark run +- Metadata files documenting model IDs and configuration +- Symlinks in `leaderboard/` directory for experiment discovery +- Automated comparison visualizations + +## Limitations and Considerations + +### Statistical Validity +- Sample size: 20 games per condition provides reasonable statistical power but may not capture rare outcomes +- Opponent variance: Using a single opponent model reduces confounding factors but may not generalize to diverse opponents + +### Prompt Engineering +- Steerability measurement is sensitive to prompt design quality +- "Aggressive" may not be the optimal prompt modification for all models +- Some models may benefit from different strategic guidance + +### Game Complexity +- Diplomacy involves significant stochasticity from opponent behavior +- Alliance formation and betrayal introduce non-deterministic outcomes +- Long game duration (1901-1925) increases variance in results + +### Measurement Scope +- Benchmark focuses on win condition (supply centers) rather than other strategic dimensions +- Does not explicitly measure negotiation quality, deception capability, or long-term planning +- Performance as a single nation (France) may not generalize to other starting positions + +### Technical Constraints +- Model errors and API failures can affect game completion rates +- Longer generation times may indicate different reasoning patterns but are not directly scored +- Prompt formatting differences across model providers may introduce artifacts diff --git a/leaderboard/SCORING_EXPLAINED.md b/leaderboard/SCORING_EXPLAINED.md new file mode 100644 index 0000000..36c0b6a --- /dev/null +++ b/leaderboard/SCORING_EXPLAINED.md @@ -0,0 +1,217 @@ +# Scoring Methodology Explanation + +This document explains the two different scoring systems used in the AI Diplomacy benchmark and how they are applied across different metrics and visualizations. + +## Two Scoring Systems Used + +### 1. Raw Supply Center Count + +**What it is:** The simple count of supply centers a power controls at the end of the game. + +**How it's calculated:** +- Directly counts the number of supply centers (territories that can produce units) owned by a power when the game ends +- Range: 0-34 supply centers (theoretical maximum, though 18 is winning threshold) +- No bonuses or penalties applied + +**Example:** +- France ends game with 11 supply centers → Raw SC score = 11 +- France eliminated with 0 supply centers → Raw SC score = 0 +- France wins solo with 18+ supply centers → Raw SC score = 18-34 + +**When it's used:** +- Supplementary metric in both CSV files for transparency +- Useful for understanding territorial control independent of game outcome timing +- Easier to interpret for quick comparisons + +### 2. Custom Game Score (DiploBench-style) + +**What it is:** A sophisticated scoring system that rewards survival, victory speed, and penalizes early elimination. + +**How it's calculated:** + +The formula depends on the game outcome: + +#### Case 1: Solo Winner (18+ supply centers) +``` +score = max_year + (max_year - win_year) + 18 +``` +- `max_year`: Maximum game year (typically 1925) minus 1900 = 25 +- `win_year`: Year when 18 supply centers were reached minus 1900 +- The `(max_year - win_year)` term rewards faster victories +- The `+ 18` bonus represents the solo win achievement + +**Example:** +- Game ends in 1925 (max_year = 25) +- France wins solo in 1920 (win_year = 20) +- Score = 25 + (25 - 20) + 18 = 25 + 5 + 18 = **48 points** +- If France won in 1915 instead: 25 + 10 + 18 = **53 points** (faster win = higher score) + +#### Case 2: Survivor (active at game end, no solo winner) +``` +score = max_year + final_supply_centers +``` + +**Example:** +- Game ends in 1925 (max_year = 25) +- France survives with 11 supply centers +- Score = 25 + 11 = **36 points** +- If France had 8 supply centers: 25 + 8 = **33 points** + +#### Case 3: Eliminated (0 supply centers before game end) +``` +score = elimination_year +``` +- `elimination_year`: Year of elimination minus 1900 + +**Example:** +- France eliminated in 1910 +- Score = 1910 - 1900 = **10 points** +- If eliminated in 1905: score = **5 points** (earlier elimination = lower score) + +#### Case 4: Lost to Solo Winner (still had units but someone else won) +``` +score = win_year +``` +- Same as elimination year but marked differently + +**Key Properties:** +- **Winning range:** Typically 42-70+ points (varies by win speed) +- **Survival range:** Typically 28-42 points (25 + low SCs to 25 + 17 SCs) +- **Elimination range:** Typically 5-24 points (early elimination to late elimination) + +## Which Score is Used Where + +### France Bar Chart Visualization (`france_scores_bar.png`) + +**Score used:** Custom Game Score (DiploBench-style) + +**Rationale:** +- The bar chart is titled "Average Game Score" and uses the `game_score` column +- This provides a more nuanced view of performance that accounts for survival time and victory quality +- Better differentiates between "survived weakly" vs "eliminated late" vs "won fast" + +**Code reference:** `/Users/alxdfy/Documents/mldev/AI_Diplomacy/analyze_diplomacy_performance_v3_textured.py`, lines 595-596 + +### Overall Performance Leaderboard (`overall_performance.csv`) + +**Primary metric:** Custom Game Score (columns: `france_mean_score`, `france_median_score`) + +**Secondary metric:** Raw Supply Centers (columns: `raw_supply_centers_mean`, `raw_supply_centers_median`) + +**Why both are included:** +- **Game Score** is the primary ranking metric because it better captures overall gameplay quality +- **Raw Supply Centers** provides context and transparency about territorial control +- Together they give a complete picture: a model might have high territorial control but poor survival time, or vice versa + +**Example interpretation:** +- Model A: game_score = 45, raw_sc = 15 → Likely survived to 1925 with moderate territory +- Model B: game_score = 15, raw_sc = 8 → Likely eliminated early despite having reasonable territory at that point +- Model C: game_score = 55, raw_sc = 19 → Likely won solo or dominated late game + +### Steerability Leaderboard (`steerability.csv`) + +**Primary metric:** Custom Game Score difference (columns: `steerability_score`, `steerability_percentage`) + +**Secondary metric:** Raw Supply Center difference (columns: `steerability_score_raw`, `steerability_percentage_raw`) + +**Why both are included:** +- **Game Score steerability** measures the true impact of aggressive prompting on overall performance +- **Raw SC steerability** shows the pure territorial control difference +- Models can be steerable in different ways: + - High game score steerability + low raw SC steerability → Better survival/timing, not just territory + - High raw SC steerability + lower game score steerability → More territory but possibly worse timing + +**Example interpretation:** +- Model X: steerability_score = +15, steerability_score_raw = +10 + - Aggressive prompting adds 15 game score points and 10 supply centers + - The difference (15 vs 10) suggests better survival timing in addition to more territory + +- Model Y: steerability_score = -5, steerability_score_raw = -2 + - Aggressive prompting actually hurts performance (negative steerability) + - Loses 5 game score points and 2 supply centers + - Model performs better with neutral/baseline prompting + +## Why Both Scores Matter + +### Complementary Insights + +1. **Game Score** captures: + - Victory quality (how fast) + - Survival duration (eliminated when) + - Overall strategic success + - Risk-reward tradeoffs + +2. **Raw Supply Centers** captures: + - Territorial expansion ability + - Pure diplomatic/military success + - Easier to interpret and compare + - Independent of timing considerations + +### Research Value + +Having both metrics allows researchers to: +- **Identify interesting patterns:** A model might be excellent at gaining territory (high raw SC) but poor at survival (low game score) +- **Understand steerability mechanisms:** Does aggressive prompting lead to more territory, better timing, or both? +- **Compare fairly:** Different use cases might prioritize different aspects of performance +- **Validate results:** If both metrics agree, it strengthens confidence in the findings + +### Example Use Cases + +**Use Case 1: Deployment Decision** +- If you want a model for long games → prioritize `game_score` (measures endurance) +- If you want a model for territorial expansion → consider `raw_supply_centers` (measures conquest) + +**Use Case 2: Steerability Analysis** +- High `steerability_score` but low `steerability_score_raw` → Model becomes more strategic with timing +- Similar values → Steerability primarily affects territorial control +- Negative values → Model responds poorly to aggressive prompting + +## Column Definitions + +### overall_performance.csv + +| Column | Definition | +|--------|------------| +| `model_name` | Base name of the model being benchmarked | +| `best_variant` | Which variant (baseline or aggressive) performed better | +| `france_mean_score` | Average Custom Game Score across all games (primary ranking metric) | +| `france_median_score` | Median Custom Game Score across all games (robust to outliers) | +| `raw_supply_centers_mean` | Average raw supply center count at game end | +| `raw_supply_centers_median` | Median raw supply center count at game end | +| `france_win_rate` | Percentage of games won (18+ supply centers) | +| `total_games` | Number of games played | +| `avg_phase_time_minutes` | Average time per game phase in minutes | +| `error_rate` | Percentage of API/LLM errors during gameplay | + +**Sorting:** Descending by `france_mean_score` (higher is better) + +### steerability.csv + +| Column | Definition | +|--------|------------| +| `model_name` | Base name of the model being analyzed | +| `baseline_mean_score` | Average Custom Game Score for baseline (neutral) prompting | +| `aggressive_mean_score` | Average Custom Game Score for aggressive prompting | +| `baseline_raw_supply_centers` | Average raw supply centers for baseline prompting | +| `aggressive_raw_supply_centers` | Average raw supply centers for aggressive prompting | +| `steerability_score` | Difference in Custom Game Score (aggressive - baseline) | +| `steerability_percentage` | Percentage change in Custom Game Score ((aggressive - baseline) / baseline * 100) | +| `steerability_score_raw` | Difference in raw supply centers (aggressive - baseline) | +| `steerability_percentage_raw` | Percentage change in raw supply centers ((aggressive - baseline) / baseline * 100) | +| `direction` | "positive" if aggressive improves performance, "negative" if it hurts | +| `baseline_games` | Number of baseline games played | +| `aggressive_games` | Number of aggressive games played | + +**Sorting:** Descending by `steerability_score` (higher positive values = more steerable toward aggression) + +## Summary + +- **Primary Metric:** Custom Game Score (DiploBench-style) - used for all rankings and bar chart +- **Secondary Metric:** Raw Supply Centers - provided for transparency and complementary analysis +- **France Bar Chart:** Uses Custom Game Score +- **Steerability:** Measures both metrics to understand different aspects of prompt influence +- **Both metrics together** provide a complete picture of model performance and behavior + +For questions about the scoring implementation, see: +- Game score calculation: `/Users/alxdfy/Documents/mldev/AI_Diplomacy/experiment_runner/analysis/summary.py` (lines 77-118) +- Data collection: `/Users/alxdfy/Documents/mldev/AI_Diplomacy/analyze_diplomacy_performance_v3_textured.py` (lines 577-611) diff --git a/leaderboard/full_comparison.py b/leaderboard/full_comparison.py new file mode 100755 index 0000000..fd7e8de --- /dev/null +++ b/leaderboard/full_comparison.py @@ -0,0 +1,203 @@ +#!/usr/bin/env python3 +""" +Auto-discovery leaderboard comparison script. + +Automatically discovers all experiments in the leaderboard/ directory, +groups them by model name (baseline vs aggressive variants), and generates +comprehensive comparison visualizations. + +Naming convention: {model_name}-baseline and {model_name}-aggressive +Example: gpt_5_medium-baseline, gpt_5_medium-aggressive +""" + +import json +import subprocess +from pathlib import Path +import sys + +def discover_experiments(leaderboard_dir="."): + """ + Discover all experiments in the leaderboard directory. + + Returns: + dict: Mapping of display labels to absolute paths + """ + leaderboard_path = Path(leaderboard_dir) + + if not leaderboard_path.exists(): + print(f"Error: {leaderboard_dir} directory not found") + sys.exit(1) + + experiments = {} + + # Scan for all directories/symlinks in leaderboard folder + for item in sorted(leaderboard_path.iterdir()): + # Skip the script itself, output directory, temp files, and log files + if item.name in ['full_comparison.py', 'leaderboard_comparison', 'temp_leaderboard_paths.json'] or item.name.endswith('.log'): + continue + + if item.is_dir() or item.is_symlink(): + # Get the name and resolve symlink if needed + name = item.name + resolved_path = item.resolve() + + # Create display label: replace underscores with spaces, capitalize words + # e.g., gpt_5_medium-baseline -> GPT-5-medium (baseline) + # e.g., sonoma_sky-aggressive -> Sonoma-sky (aggressive) + + if '-baseline' in name: + model_name = name.replace('-baseline', '').replace('_', '-') + display_label = f"{model_name}" + elif '-aggressive' in name: + model_name = name.replace('-aggressive', '').replace('_', '-') + display_label = f"{model_name}-aggressive" + else: + # Fallback for any non-standard naming + display_label = name.replace('_', '-') + + experiments[display_label] = str(resolved_path) + + return experiments + +def main(): + print("=== Leaderboard Auto-Discovery Comparison ===\n") + + # Get absolute paths for script location and parent directory + script_dir = Path(__file__).resolve().parent + parent_dir = script_dir.parent + + # Discover all experiments (using absolute path to leaderboard dir) + print("Discovering experiments in leaderboard directory...") + experiments = discover_experiments(script_dir) + + if not experiments: + print("No experiments found in leaderboard directory") + return + + print(f"Found {len(experiments)} experiments:") + for label in sorted(experiments.keys()): + print(f" - {label}") + print() + + # Create output directory (absolute path) + output_dir = script_dir / "leaderboard_comparison" + output_dir.mkdir(exist_ok=True) + print(f"Output directory: {output_dir}/\n") + + # Import visualization functions from parent directory + print("Loading analysis modules...") + sys.path.insert(0, str(parent_dir)) + from analyze_diplomacy_performance_v3_textured import ( + collect_timing_data_v3, create_stacked_bar_chart_v3, + collect_move_type_data, create_move_type_chart, + collect_error_data, create_error_chart + ) + + # Collect and generate timing analysis + print("\nCollecting timing data (concurrent operations)...") + timing_data = collect_timing_data_v3(experiments) + if timing_data: + create_stacked_bar_chart_v3(timing_data, output_dir / "phase_timing_comparison.png") + print(" ✓ Generated phase_timing_comparison.png") + + # Collect and generate move type analysis + print("\nCollecting move type data...") + move_data = collect_move_type_data(experiments) + if move_data: + create_move_type_chart(move_data, output_dir / "move_type_comparison.png") + print(" ✓ Generated move_type_comparison.png") + + # Collect and generate error analysis + print("\nCollecting error data...") + error_data = collect_error_data(experiments) + if error_data: + create_error_chart(error_data, output_dir / "error_comparison.png") + print(" ✓ Generated error_comparison.png") + + # Import additional analysis functions + print("\nGenerating additional visualizations...") + from analyze_diplomacy_performance_v3_textured import ( + collect_france_scores, + plot_france_scores_bar, + plot_france_scores_box, + plot_diplomatic_credit_heatmap, + plot_relative_sentiment + ) + + # Reverse mapping for compatibility with existing functions + path_to_label = {v: k for k, v in experiments.items()} + + # Generate remaining visualizations directly (no subprocess needed) + try: + # Collect France scores once and reuse + print(" Collecting France game scores...") + all_scores_df = collect_france_scores(path_to_label) + + if not all_scores_df.empty: + print(" Generating score visualizations...") + plot_france_scores_bar(all_scores_df, output_dir / "france_scores_bar.png") + plot_france_scores_box(all_scores_df, output_dir / "france_scores_box.png") + print(" ✓ Generated score visualizations") + + # Generate diplomatic heatmaps + print(" Generating diplomatic heatmaps...") + plot_diplomatic_credit_heatmap(path_to_label, "other_to_france", output_dir / "heatmap_other_to_france.png") + plot_diplomatic_credit_heatmap(path_to_label, "france_to_other", output_dir / "heatmap_france_to_other.png") + print(" ✓ Generated heatmaps") + + # Generate relative sentiment chart + print(" Generating relative sentiment chart...") + plot_relative_sentiment(path_to_label, output_dir / "relative_sentiment.png") + print(" ✓ Generated relative sentiment chart") + + print(" ✓ Generated additional visualizations") + + except Exception as e: + print(f" ⚠ Warning: Some visualizations may have failed: {e}") + + # List generated files + files = sorted(output_dir.glob("*.png")) + print(f"\n=== Complete! ===") + print(f"Generated {len(files)} visualizations in {output_dir}/:") + for f in files: + print(f" - {f.name}") + + # Print summary by model family + print("\n=== Experiments by Family ===") + + families = { + 'GPT-5': [], + 'GPT-OSS': [], + 'O-series': [], + 'Gemini': [], + 'Hermes': [], + 'Sonoma': [], + 'Other': [] + } + + for label in sorted(experiments.keys()): + if 'gpt-5' in label.lower(): + families['GPT-5'].append(label) + elif 'gpt-oss' in label.lower(): + families['GPT-OSS'].append(label) + elif label.startswith('o3') or label.startswith('o4'): + families['O-series'].append(label) + elif 'gemini' in label.lower(): + families['Gemini'].append(label) + elif 'hermes' in label.lower(): + families['Hermes'].append(label) + elif 'sonoma' in label.lower(): + families['Sonoma'].append(label) + else: + families['Other'].append(label) + + for family, models in families.items(): + if models: + print(f"\n{family}:") + for model in models: + print(f" - {model}") + + print("\n✓ Leaderboard comparison complete!") + +if __name__ == "__main__": + main()