Add leaderboard visualization and documentation files

This commit is contained in:
AlxAI 2025-11-17 23:01:25 -05:00
parent aeda029a59
commit 0e66c19b15
4 changed files with 640 additions and 0 deletions

View file

@ -0,0 +1,74 @@
# Diplomacy Benchmark Guide
## Single Model Benchmark
```bash
# Production: 20 games to 1925
python run_benchmark.py --model_id "openai:gpt-4o" --friendly_name "gpt_4o"
# Test: 3 games to 1901
python run_benchmark.py --model_id "openai:gpt-4o" --friendly_name "test" --test
# Baseline only
python run_benchmark.py --model_id "..." --friendly_name "..." --baseline-only
# Aggressive only
python run_benchmark.py --model_id "..." --friendly_name "..." --aggressive-only
```
## Queue Multiple Models
Edit `run_benchmark_queue.sh`:
```bash
MODELS=(
"model_id|friendly_name|baseline_only|aggressive_only"
"openai:gpt-4o|gpt_4o||" # Both modes
"gemini-2.5-flash|gemini|true|" # Baseline only
"claude-3.5-sonnet|claude||true" # Aggressive only
)
```
Run:
```bash
./run_benchmark_queue.sh
```
## Check Status
```bash
# Queue progress
tail -f /tmp/benchmark_queue/queue.log
# Individual model logs
tail -f /tmp/benchmark_queue/{friendly_name}_{baseline|aggressive}.log
# Running games (check FRANCE progress)
ls -la results/{friendly_name}_{baseline|aggressive}/games/*/
```
## Visualize Results
```bash
# Generate comparison plots
python leaderboard/full_comparison.py
# View plots
open leaderboard/*.png
```
## Key Locations
| Item | Path |
|------|------|
| **Results** | `results/{friendly_name}_{baseline\|aggressive}/` |
| **Leaderboard Links** | `leaderboard/{friendly_name}-{baseline\|aggressive}` |
| **Queue Logs** | `/tmp/benchmark_queue/` |
| **Prompts Baseline** | `prompts_benchmark/` |
| **Prompts Aggressive** | `prompts_hold_reduction_v3/` |
## Notes
- Test model plays FRANCE (position 2, 0-indexed)
- Opponents: devstral-small
- Results auto-symlinked to leaderboard/
- Benchmark handles retries, logging, error recovery

146
leaderboard/METHODOLOGY.md Normal file
View file

@ -0,0 +1,146 @@
# Diplomacy LLM Benchmark Methodology
## Overview
This benchmark evaluates Large Language Models (LLMs) on their ability to play the strategic board game Diplomacy, with a specific focus on measuring both **performance** (how well the model plays) and **steerability** (how responsive the model is to strategic prompt modifications). Each model plays as France against six identical opponent models (Mistral Devstral-Small) across multiple game iterations.
## Performance Score Calculation
### Game Score Formula
The benchmark uses a modified DiploBench scoring system that rewards both survival and victory:
- **Solo Winner**: `max_year + (max_year - win_year) + 18`
- Where `win_year` is when the power reached 18+ supply centers
- Rewards faster victories with bonus points
- **Survivor** (active at game end): `max_year + final_supply_centers`
- Rewards territorial control for nations that survive to the end
- **Eliminated**: `elimination_year`
- Year of elimination (relative to 1900)
**Key parameters:**
- `max_year`: Maximum game year (default: 1925) minus 1900 = 25
- All years are normalized by subtracting 1900
- Supply centers range from 0-34 (total centers on the board)
### Aggregation
Performance is calculated as the **mean game score across all iterations** for the test model playing as France. In the production benchmark:
- 20 game iterations per experiment
- Games run to 1925 (25 years of gameplay)
- Parallel execution for efficiency
### Why These Metrics Matter
The game score formula captures:
- **Territorial expansion**: Higher supply center counts indicate successful negotiation and military strategy
- **Survival**: Avoiding elimination is critical in Diplomacy
- **Victory speed**: Faster solo victories earn bonus points
- **Competitive balance**: Scores are comparable across different game lengths
## Steerability Score Calculation
### Definition
Steerability quantifies how much a model's behavior changes in response to modified strategic prompts. We compare two prompt variants:
1. **Baseline prompts** (`prompts_benchmark`): Standard strategic guidance emphasizing balanced diplomacy
2. **Aggressive prompts** (`prompts_hold_reduction_v3`): Modified prompts encouraging more aggressive territorial expansion and reduced defensive holds
### Calculation Method
```
Steerability = Performance_aggressive - Performance_baseline
```
Where performance is measured by mean game score across all iterations.
### Interpretation
- **Positive steerability** (+): Model performs better with aggressive prompts, indicating successful behavioral adaptation
- **Negative steerability** (): Model performs worse with aggressive prompts, suggesting either:
- Failure to adapt to prompt modifications
- Over-aggressive behavior that undermines diplomatic strategy
- **Near-zero steerability** (~0): Model behavior is largely invariant to prompt changes
### Why Steerability Matters
Steerability is a critical but under-explored dimension of LLM capability:
- **Controllability**: Models should adapt their strategy when users provide different instructions
- **Prompt sensitivity**: Measures whether models genuinely understand strategic guidance vs. memorized patterns
- **Real-world utility**: Production systems need models that can be steered toward desired behaviors
- **Safety implications**: Models that can't be steered may be difficult to align or control
## Experimental Setup
### Game Configuration
- **Test position**: France (3rd position in 7-nation setup)
- **Opponent model**: Mistral Devstral-Small (all 6 opposing nations)
- **Max year**: 1925 (production) / 1901 (test mode)
- **Iterations**: 20 games per experiment (production) / 3 games (test mode)
- **Parallel workers**: 20 concurrent games (production) / 3 (test mode)
### Prompt Variants
- **Baseline**: `ai_diplomacy/prompts/prompts_benchmark`
- Standard diplomatic and strategic guidance
- Balanced approach to offense and defense
- **Aggressive**: `ai_diplomacy/prompts/prompts_hold_reduction_v3`
- Emphasis on territorial expansion
- Reduction in defensive hold orders
- More assertive negotiation stance
### Test Model Configuration
- Models are inserted at position 2 (France) in the power order
- Order: Austria, England, **France**, Germany, Italy, Russia, Turkey
- All opponent positions use the same baseline opponent model
## Data Collection
### Per-Game Metrics
For each game iteration, the system records:
- Supply center progression by phase
- Order types and success rates (moves, holds, supports, convoys)
- Diplomatic messaging patterns
- LLM response statistics and errors
- Game outcome and final rankings
### Aggregated Analysis
The `statistical_game_analysis` module produces:
- **Game-level CSV**: Per-game summaries with 88 metrics including game score, final supply centers, order statistics, and messaging patterns
- **Phase-level CSV**: Turn-by-turn state including supply centers, military units, relationships, and sentiment
- **Combined analysis**: Aggregated statistics across all game iterations
### Performance Tracking
- Console logs for each benchmark run
- Metadata files documenting model IDs and configuration
- Symlinks in `leaderboard/` directory for experiment discovery
- Automated comparison visualizations
## Limitations and Considerations
### Statistical Validity
- Sample size: 20 games per condition provides reasonable statistical power but may not capture rare outcomes
- Opponent variance: Using a single opponent model reduces confounding factors but may not generalize to diverse opponents
### Prompt Engineering
- Steerability measurement is sensitive to prompt design quality
- "Aggressive" may not be the optimal prompt modification for all models
- Some models may benefit from different strategic guidance
### Game Complexity
- Diplomacy involves significant stochasticity from opponent behavior
- Alliance formation and betrayal introduce non-deterministic outcomes
- Long game duration (1901-1925) increases variance in results
### Measurement Scope
- Benchmark focuses on win condition (supply centers) rather than other strategic dimensions
- Does not explicitly measure negotiation quality, deception capability, or long-term planning
- Performance as a single nation (France) may not generalize to other starting positions
### Technical Constraints
- Model errors and API failures can affect game completion rates
- Longer generation times may indicate different reasoning patterns but are not directly scored
- Prompt formatting differences across model providers may introduce artifacts

View file

@ -0,0 +1,217 @@
# Scoring Methodology Explanation
This document explains the two different scoring systems used in the AI Diplomacy benchmark and how they are applied across different metrics and visualizations.
## Two Scoring Systems Used
### 1. Raw Supply Center Count
**What it is:** The simple count of supply centers a power controls at the end of the game.
**How it's calculated:**
- Directly counts the number of supply centers (territories that can produce units) owned by a power when the game ends
- Range: 0-34 supply centers (theoretical maximum, though 18 is winning threshold)
- No bonuses or penalties applied
**Example:**
- France ends game with 11 supply centers → Raw SC score = 11
- France eliminated with 0 supply centers → Raw SC score = 0
- France wins solo with 18+ supply centers → Raw SC score = 18-34
**When it's used:**
- Supplementary metric in both CSV files for transparency
- Useful for understanding territorial control independent of game outcome timing
- Easier to interpret for quick comparisons
### 2. Custom Game Score (DiploBench-style)
**What it is:** A sophisticated scoring system that rewards survival, victory speed, and penalizes early elimination.
**How it's calculated:**
The formula depends on the game outcome:
#### Case 1: Solo Winner (18+ supply centers)
```
score = max_year + (max_year - win_year) + 18
```
- `max_year`: Maximum game year (typically 1925) minus 1900 = 25
- `win_year`: Year when 18 supply centers were reached minus 1900
- The `(max_year - win_year)` term rewards faster victories
- The `+ 18` bonus represents the solo win achievement
**Example:**
- Game ends in 1925 (max_year = 25)
- France wins solo in 1920 (win_year = 20)
- Score = 25 + (25 - 20) + 18 = 25 + 5 + 18 = **48 points**
- If France won in 1915 instead: 25 + 10 + 18 = **53 points** (faster win = higher score)
#### Case 2: Survivor (active at game end, no solo winner)
```
score = max_year + final_supply_centers
```
**Example:**
- Game ends in 1925 (max_year = 25)
- France survives with 11 supply centers
- Score = 25 + 11 = **36 points**
- If France had 8 supply centers: 25 + 8 = **33 points**
#### Case 3: Eliminated (0 supply centers before game end)
```
score = elimination_year
```
- `elimination_year`: Year of elimination minus 1900
**Example:**
- France eliminated in 1910
- Score = 1910 - 1900 = **10 points**
- If eliminated in 1905: score = **5 points** (earlier elimination = lower score)
#### Case 4: Lost to Solo Winner (still had units but someone else won)
```
score = win_year
```
- Same as elimination year but marked differently
**Key Properties:**
- **Winning range:** Typically 42-70+ points (varies by win speed)
- **Survival range:** Typically 28-42 points (25 + low SCs to 25 + 17 SCs)
- **Elimination range:** Typically 5-24 points (early elimination to late elimination)
## Which Score is Used Where
### France Bar Chart Visualization (`france_scores_bar.png`)
**Score used:** Custom Game Score (DiploBench-style)
**Rationale:**
- The bar chart is titled "Average Game Score" and uses the `game_score` column
- This provides a more nuanced view of performance that accounts for survival time and victory quality
- Better differentiates between "survived weakly" vs "eliminated late" vs "won fast"
**Code reference:** `/Users/alxdfy/Documents/mldev/AI_Diplomacy/analyze_diplomacy_performance_v3_textured.py`, lines 595-596
### Overall Performance Leaderboard (`overall_performance.csv`)
**Primary metric:** Custom Game Score (columns: `france_mean_score`, `france_median_score`)
**Secondary metric:** Raw Supply Centers (columns: `raw_supply_centers_mean`, `raw_supply_centers_median`)
**Why both are included:**
- **Game Score** is the primary ranking metric because it better captures overall gameplay quality
- **Raw Supply Centers** provides context and transparency about territorial control
- Together they give a complete picture: a model might have high territorial control but poor survival time, or vice versa
**Example interpretation:**
- Model A: game_score = 45, raw_sc = 15 → Likely survived to 1925 with moderate territory
- Model B: game_score = 15, raw_sc = 8 → Likely eliminated early despite having reasonable territory at that point
- Model C: game_score = 55, raw_sc = 19 → Likely won solo or dominated late game
### Steerability Leaderboard (`steerability.csv`)
**Primary metric:** Custom Game Score difference (columns: `steerability_score`, `steerability_percentage`)
**Secondary metric:** Raw Supply Center difference (columns: `steerability_score_raw`, `steerability_percentage_raw`)
**Why both are included:**
- **Game Score steerability** measures the true impact of aggressive prompting on overall performance
- **Raw SC steerability** shows the pure territorial control difference
- Models can be steerable in different ways:
- High game score steerability + low raw SC steerability → Better survival/timing, not just territory
- High raw SC steerability + lower game score steerability → More territory but possibly worse timing
**Example interpretation:**
- Model X: steerability_score = +15, steerability_score_raw = +10
- Aggressive prompting adds 15 game score points and 10 supply centers
- The difference (15 vs 10) suggests better survival timing in addition to more territory
- Model Y: steerability_score = -5, steerability_score_raw = -2
- Aggressive prompting actually hurts performance (negative steerability)
- Loses 5 game score points and 2 supply centers
- Model performs better with neutral/baseline prompting
## Why Both Scores Matter
### Complementary Insights
1. **Game Score** captures:
- Victory quality (how fast)
- Survival duration (eliminated when)
- Overall strategic success
- Risk-reward tradeoffs
2. **Raw Supply Centers** captures:
- Territorial expansion ability
- Pure diplomatic/military success
- Easier to interpret and compare
- Independent of timing considerations
### Research Value
Having both metrics allows researchers to:
- **Identify interesting patterns:** A model might be excellent at gaining territory (high raw SC) but poor at survival (low game score)
- **Understand steerability mechanisms:** Does aggressive prompting lead to more territory, better timing, or both?
- **Compare fairly:** Different use cases might prioritize different aspects of performance
- **Validate results:** If both metrics agree, it strengthens confidence in the findings
### Example Use Cases
**Use Case 1: Deployment Decision**
- If you want a model for long games → prioritize `game_score` (measures endurance)
- If you want a model for territorial expansion → consider `raw_supply_centers` (measures conquest)
**Use Case 2: Steerability Analysis**
- High `steerability_score` but low `steerability_score_raw` → Model becomes more strategic with timing
- Similar values → Steerability primarily affects territorial control
- Negative values → Model responds poorly to aggressive prompting
## Column Definitions
### overall_performance.csv
| Column | Definition |
|--------|------------|
| `model_name` | Base name of the model being benchmarked |
| `best_variant` | Which variant (baseline or aggressive) performed better |
| `france_mean_score` | Average Custom Game Score across all games (primary ranking metric) |
| `france_median_score` | Median Custom Game Score across all games (robust to outliers) |
| `raw_supply_centers_mean` | Average raw supply center count at game end |
| `raw_supply_centers_median` | Median raw supply center count at game end |
| `france_win_rate` | Percentage of games won (18+ supply centers) |
| `total_games` | Number of games played |
| `avg_phase_time_minutes` | Average time per game phase in minutes |
| `error_rate` | Percentage of API/LLM errors during gameplay |
**Sorting:** Descending by `france_mean_score` (higher is better)
### steerability.csv
| Column | Definition |
|--------|------------|
| `model_name` | Base name of the model being analyzed |
| `baseline_mean_score` | Average Custom Game Score for baseline (neutral) prompting |
| `aggressive_mean_score` | Average Custom Game Score for aggressive prompting |
| `baseline_raw_supply_centers` | Average raw supply centers for baseline prompting |
| `aggressive_raw_supply_centers` | Average raw supply centers for aggressive prompting |
| `steerability_score` | Difference in Custom Game Score (aggressive - baseline) |
| `steerability_percentage` | Percentage change in Custom Game Score ((aggressive - baseline) / baseline * 100) |
| `steerability_score_raw` | Difference in raw supply centers (aggressive - baseline) |
| `steerability_percentage_raw` | Percentage change in raw supply centers ((aggressive - baseline) / baseline * 100) |
| `direction` | "positive" if aggressive improves performance, "negative" if it hurts |
| `baseline_games` | Number of baseline games played |
| `aggressive_games` | Number of aggressive games played |
**Sorting:** Descending by `steerability_score` (higher positive values = more steerable toward aggression)
## Summary
- **Primary Metric:** Custom Game Score (DiploBench-style) - used for all rankings and bar chart
- **Secondary Metric:** Raw Supply Centers - provided for transparency and complementary analysis
- **France Bar Chart:** Uses Custom Game Score
- **Steerability:** Measures both metrics to understand different aspects of prompt influence
- **Both metrics together** provide a complete picture of model performance and behavior
For questions about the scoring implementation, see:
- Game score calculation: `/Users/alxdfy/Documents/mldev/AI_Diplomacy/experiment_runner/analysis/summary.py` (lines 77-118)
- Data collection: `/Users/alxdfy/Documents/mldev/AI_Diplomacy/analyze_diplomacy_performance_v3_textured.py` (lines 577-611)

203
leaderboard/full_comparison.py Executable file
View file

@ -0,0 +1,203 @@
#!/usr/bin/env python3
"""
Auto-discovery leaderboard comparison script.
Automatically discovers all experiments in the leaderboard/ directory,
groups them by model name (baseline vs aggressive variants), and generates
comprehensive comparison visualizations.
Naming convention: {model_name}-baseline and {model_name}-aggressive
Example: gpt_5_medium-baseline, gpt_5_medium-aggressive
"""
import json
import subprocess
from pathlib import Path
import sys
def discover_experiments(leaderboard_dir="."):
"""
Discover all experiments in the leaderboard directory.
Returns:
dict: Mapping of display labels to absolute paths
"""
leaderboard_path = Path(leaderboard_dir)
if not leaderboard_path.exists():
print(f"Error: {leaderboard_dir} directory not found")
sys.exit(1)
experiments = {}
# Scan for all directories/symlinks in leaderboard folder
for item in sorted(leaderboard_path.iterdir()):
# Skip the script itself, output directory, temp files, and log files
if item.name in ['full_comparison.py', 'leaderboard_comparison', 'temp_leaderboard_paths.json'] or item.name.endswith('.log'):
continue
if item.is_dir() or item.is_symlink():
# Get the name and resolve symlink if needed
name = item.name
resolved_path = item.resolve()
# Create display label: replace underscores with spaces, capitalize words
# e.g., gpt_5_medium-baseline -> GPT-5-medium (baseline)
# e.g., sonoma_sky-aggressive -> Sonoma-sky (aggressive)
if '-baseline' in name:
model_name = name.replace('-baseline', '').replace('_', '-')
display_label = f"{model_name}"
elif '-aggressive' in name:
model_name = name.replace('-aggressive', '').replace('_', '-')
display_label = f"{model_name}-aggressive"
else:
# Fallback for any non-standard naming
display_label = name.replace('_', '-')
experiments[display_label] = str(resolved_path)
return experiments
def main():
print("=== Leaderboard Auto-Discovery Comparison ===\n")
# Get absolute paths for script location and parent directory
script_dir = Path(__file__).resolve().parent
parent_dir = script_dir.parent
# Discover all experiments (using absolute path to leaderboard dir)
print("Discovering experiments in leaderboard directory...")
experiments = discover_experiments(script_dir)
if not experiments:
print("No experiments found in leaderboard directory")
return
print(f"Found {len(experiments)} experiments:")
for label in sorted(experiments.keys()):
print(f" - {label}")
print()
# Create output directory (absolute path)
output_dir = script_dir / "leaderboard_comparison"
output_dir.mkdir(exist_ok=True)
print(f"Output directory: {output_dir}/\n")
# Import visualization functions from parent directory
print("Loading analysis modules...")
sys.path.insert(0, str(parent_dir))
from analyze_diplomacy_performance_v3_textured import (
collect_timing_data_v3, create_stacked_bar_chart_v3,
collect_move_type_data, create_move_type_chart,
collect_error_data, create_error_chart
)
# Collect and generate timing analysis
print("\nCollecting timing data (concurrent operations)...")
timing_data = collect_timing_data_v3(experiments)
if timing_data:
create_stacked_bar_chart_v3(timing_data, output_dir / "phase_timing_comparison.png")
print(" ✓ Generated phase_timing_comparison.png")
# Collect and generate move type analysis
print("\nCollecting move type data...")
move_data = collect_move_type_data(experiments)
if move_data:
create_move_type_chart(move_data, output_dir / "move_type_comparison.png")
print(" ✓ Generated move_type_comparison.png")
# Collect and generate error analysis
print("\nCollecting error data...")
error_data = collect_error_data(experiments)
if error_data:
create_error_chart(error_data, output_dir / "error_comparison.png")
print(" ✓ Generated error_comparison.png")
# Import additional analysis functions
print("\nGenerating additional visualizations...")
from analyze_diplomacy_performance_v3_textured import (
collect_france_scores,
plot_france_scores_bar,
plot_france_scores_box,
plot_diplomatic_credit_heatmap,
plot_relative_sentiment
)
# Reverse mapping for compatibility with existing functions
path_to_label = {v: k for k, v in experiments.items()}
# Generate remaining visualizations directly (no subprocess needed)
try:
# Collect France scores once and reuse
print(" Collecting France game scores...")
all_scores_df = collect_france_scores(path_to_label)
if not all_scores_df.empty:
print(" Generating score visualizations...")
plot_france_scores_bar(all_scores_df, output_dir / "france_scores_bar.png")
plot_france_scores_box(all_scores_df, output_dir / "france_scores_box.png")
print(" ✓ Generated score visualizations")
# Generate diplomatic heatmaps
print(" Generating diplomatic heatmaps...")
plot_diplomatic_credit_heatmap(path_to_label, "other_to_france", output_dir / "heatmap_other_to_france.png")
plot_diplomatic_credit_heatmap(path_to_label, "france_to_other", output_dir / "heatmap_france_to_other.png")
print(" ✓ Generated heatmaps")
# Generate relative sentiment chart
print(" Generating relative sentiment chart...")
plot_relative_sentiment(path_to_label, output_dir / "relative_sentiment.png")
print(" ✓ Generated relative sentiment chart")
print(" ✓ Generated additional visualizations")
except Exception as e:
print(f" ⚠ Warning: Some visualizations may have failed: {e}")
# List generated files
files = sorted(output_dir.glob("*.png"))
print(f"\n=== Complete! ===")
print(f"Generated {len(files)} visualizations in {output_dir}/:")
for f in files:
print(f" - {f.name}")
# Print summary by model family
print("\n=== Experiments by Family ===")
families = {
'GPT-5': [],
'GPT-OSS': [],
'O-series': [],
'Gemini': [],
'Hermes': [],
'Sonoma': [],
'Other': []
}
for label in sorted(experiments.keys()):
if 'gpt-5' in label.lower():
families['GPT-5'].append(label)
elif 'gpt-oss' in label.lower():
families['GPT-OSS'].append(label)
elif label.startswith('o3') or label.startswith('o4'):
families['O-series'].append(label)
elif 'gemini' in label.lower():
families['Gemini'].append(label)
elif 'hermes' in label.lower():
families['Hermes'].append(label)
elif 'sonoma' in label.lower():
families['Sonoma'].append(label)
else:
families['Other'].append(label)
for family, models in families.items():
if models:
print(f"\n{family}:")
for model in models:
print(f" - {model}")
print("\n✓ Leaderboard comparison complete!")
if __name__ == "__main__":
main()