mirror of
https://github.com/GoodStartLabs/AI_Diplomacy.git
synced 2026-04-19 12:58:09 +00:00
Add leaderboard visualization and documentation files
This commit is contained in:
parent
aeda029a59
commit
0e66c19b15
4 changed files with 640 additions and 0 deletions
74
leaderboard/BENCHMARK_GUIDE.md
Normal file
74
leaderboard/BENCHMARK_GUIDE.md
Normal file
|
|
@ -0,0 +1,74 @@
|
|||
# Diplomacy Benchmark Guide
|
||||
|
||||
## Single Model Benchmark
|
||||
|
||||
```bash
|
||||
# Production: 20 games to 1925
|
||||
python run_benchmark.py --model_id "openai:gpt-4o" --friendly_name "gpt_4o"
|
||||
|
||||
# Test: 3 games to 1901
|
||||
python run_benchmark.py --model_id "openai:gpt-4o" --friendly_name "test" --test
|
||||
|
||||
# Baseline only
|
||||
python run_benchmark.py --model_id "..." --friendly_name "..." --baseline-only
|
||||
|
||||
# Aggressive only
|
||||
python run_benchmark.py --model_id "..." --friendly_name "..." --aggressive-only
|
||||
```
|
||||
|
||||
## Queue Multiple Models
|
||||
|
||||
Edit `run_benchmark_queue.sh`:
|
||||
```bash
|
||||
MODELS=(
|
||||
"model_id|friendly_name|baseline_only|aggressive_only"
|
||||
"openai:gpt-4o|gpt_4o||" # Both modes
|
||||
"gemini-2.5-flash|gemini|true|" # Baseline only
|
||||
"claude-3.5-sonnet|claude||true" # Aggressive only
|
||||
)
|
||||
```
|
||||
|
||||
Run:
|
||||
```bash
|
||||
./run_benchmark_queue.sh
|
||||
```
|
||||
|
||||
## Check Status
|
||||
|
||||
```bash
|
||||
# Queue progress
|
||||
tail -f /tmp/benchmark_queue/queue.log
|
||||
|
||||
# Individual model logs
|
||||
tail -f /tmp/benchmark_queue/{friendly_name}_{baseline|aggressive}.log
|
||||
|
||||
# Running games (check FRANCE progress)
|
||||
ls -la results/{friendly_name}_{baseline|aggressive}/games/*/
|
||||
```
|
||||
|
||||
## Visualize Results
|
||||
|
||||
```bash
|
||||
# Generate comparison plots
|
||||
python leaderboard/full_comparison.py
|
||||
|
||||
# View plots
|
||||
open leaderboard/*.png
|
||||
```
|
||||
|
||||
## Key Locations
|
||||
|
||||
| Item | Path |
|
||||
|------|------|
|
||||
| **Results** | `results/{friendly_name}_{baseline\|aggressive}/` |
|
||||
| **Leaderboard Links** | `leaderboard/{friendly_name}-{baseline\|aggressive}` |
|
||||
| **Queue Logs** | `/tmp/benchmark_queue/` |
|
||||
| **Prompts Baseline** | `prompts_benchmark/` |
|
||||
| **Prompts Aggressive** | `prompts_hold_reduction_v3/` |
|
||||
|
||||
## Notes
|
||||
|
||||
- Test model plays FRANCE (position 2, 0-indexed)
|
||||
- Opponents: devstral-small
|
||||
- Results auto-symlinked to leaderboard/
|
||||
- Benchmark handles retries, logging, error recovery
|
||||
146
leaderboard/METHODOLOGY.md
Normal file
146
leaderboard/METHODOLOGY.md
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
# Diplomacy LLM Benchmark Methodology
|
||||
|
||||
## Overview
|
||||
|
||||
This benchmark evaluates Large Language Models (LLMs) on their ability to play the strategic board game Diplomacy, with a specific focus on measuring both **performance** (how well the model plays) and **steerability** (how responsive the model is to strategic prompt modifications). Each model plays as France against six identical opponent models (Mistral Devstral-Small) across multiple game iterations.
|
||||
|
||||
## Performance Score Calculation
|
||||
|
||||
### Game Score Formula
|
||||
|
||||
The benchmark uses a modified DiploBench scoring system that rewards both survival and victory:
|
||||
|
||||
- **Solo Winner**: `max_year + (max_year - win_year) + 18`
|
||||
- Where `win_year` is when the power reached 18+ supply centers
|
||||
- Rewards faster victories with bonus points
|
||||
|
||||
- **Survivor** (active at game end): `max_year + final_supply_centers`
|
||||
- Rewards territorial control for nations that survive to the end
|
||||
|
||||
- **Eliminated**: `elimination_year`
|
||||
- Year of elimination (relative to 1900)
|
||||
|
||||
**Key parameters:**
|
||||
- `max_year`: Maximum game year (default: 1925) minus 1900 = 25
|
||||
- All years are normalized by subtracting 1900
|
||||
- Supply centers range from 0-34 (total centers on the board)
|
||||
|
||||
### Aggregation
|
||||
|
||||
Performance is calculated as the **mean game score across all iterations** for the test model playing as France. In the production benchmark:
|
||||
- 20 game iterations per experiment
|
||||
- Games run to 1925 (25 years of gameplay)
|
||||
- Parallel execution for efficiency
|
||||
|
||||
### Why These Metrics Matter
|
||||
|
||||
The game score formula captures:
|
||||
- **Territorial expansion**: Higher supply center counts indicate successful negotiation and military strategy
|
||||
- **Survival**: Avoiding elimination is critical in Diplomacy
|
||||
- **Victory speed**: Faster solo victories earn bonus points
|
||||
- **Competitive balance**: Scores are comparable across different game lengths
|
||||
|
||||
## Steerability Score Calculation
|
||||
|
||||
### Definition
|
||||
|
||||
Steerability quantifies how much a model's behavior changes in response to modified strategic prompts. We compare two prompt variants:
|
||||
|
||||
1. **Baseline prompts** (`prompts_benchmark`): Standard strategic guidance emphasizing balanced diplomacy
|
||||
2. **Aggressive prompts** (`prompts_hold_reduction_v3`): Modified prompts encouraging more aggressive territorial expansion and reduced defensive holds
|
||||
|
||||
### Calculation Method
|
||||
|
||||
```
|
||||
Steerability = Performance_aggressive - Performance_baseline
|
||||
```
|
||||
|
||||
Where performance is measured by mean game score across all iterations.
|
||||
|
||||
### Interpretation
|
||||
|
||||
- **Positive steerability** (+): Model performs better with aggressive prompts, indicating successful behavioral adaptation
|
||||
- **Negative steerability** (−): Model performs worse with aggressive prompts, suggesting either:
|
||||
- Failure to adapt to prompt modifications
|
||||
- Over-aggressive behavior that undermines diplomatic strategy
|
||||
- **Near-zero steerability** (~0): Model behavior is largely invariant to prompt changes
|
||||
|
||||
### Why Steerability Matters
|
||||
|
||||
Steerability is a critical but under-explored dimension of LLM capability:
|
||||
- **Controllability**: Models should adapt their strategy when users provide different instructions
|
||||
- **Prompt sensitivity**: Measures whether models genuinely understand strategic guidance vs. memorized patterns
|
||||
- **Real-world utility**: Production systems need models that can be steered toward desired behaviors
|
||||
- **Safety implications**: Models that can't be steered may be difficult to align or control
|
||||
|
||||
## Experimental Setup
|
||||
|
||||
### Game Configuration
|
||||
- **Test position**: France (3rd position in 7-nation setup)
|
||||
- **Opponent model**: Mistral Devstral-Small (all 6 opposing nations)
|
||||
- **Max year**: 1925 (production) / 1901 (test mode)
|
||||
- **Iterations**: 20 games per experiment (production) / 3 games (test mode)
|
||||
- **Parallel workers**: 20 concurrent games (production) / 3 (test mode)
|
||||
|
||||
### Prompt Variants
|
||||
- **Baseline**: `ai_diplomacy/prompts/prompts_benchmark`
|
||||
- Standard diplomatic and strategic guidance
|
||||
- Balanced approach to offense and defense
|
||||
|
||||
- **Aggressive**: `ai_diplomacy/prompts/prompts_hold_reduction_v3`
|
||||
- Emphasis on territorial expansion
|
||||
- Reduction in defensive hold orders
|
||||
- More assertive negotiation stance
|
||||
|
||||
### Test Model Configuration
|
||||
- Models are inserted at position 2 (France) in the power order
|
||||
- Order: Austria, England, **France**, Germany, Italy, Russia, Turkey
|
||||
- All opponent positions use the same baseline opponent model
|
||||
|
||||
## Data Collection
|
||||
|
||||
### Per-Game Metrics
|
||||
For each game iteration, the system records:
|
||||
- Supply center progression by phase
|
||||
- Order types and success rates (moves, holds, supports, convoys)
|
||||
- Diplomatic messaging patterns
|
||||
- LLM response statistics and errors
|
||||
- Game outcome and final rankings
|
||||
|
||||
### Aggregated Analysis
|
||||
The `statistical_game_analysis` module produces:
|
||||
- **Game-level CSV**: Per-game summaries with 88 metrics including game score, final supply centers, order statistics, and messaging patterns
|
||||
- **Phase-level CSV**: Turn-by-turn state including supply centers, military units, relationships, and sentiment
|
||||
- **Combined analysis**: Aggregated statistics across all game iterations
|
||||
|
||||
### Performance Tracking
|
||||
- Console logs for each benchmark run
|
||||
- Metadata files documenting model IDs and configuration
|
||||
- Symlinks in `leaderboard/` directory for experiment discovery
|
||||
- Automated comparison visualizations
|
||||
|
||||
## Limitations and Considerations
|
||||
|
||||
### Statistical Validity
|
||||
- Sample size: 20 games per condition provides reasonable statistical power but may not capture rare outcomes
|
||||
- Opponent variance: Using a single opponent model reduces confounding factors but may not generalize to diverse opponents
|
||||
|
||||
### Prompt Engineering
|
||||
- Steerability measurement is sensitive to prompt design quality
|
||||
- "Aggressive" may not be the optimal prompt modification for all models
|
||||
- Some models may benefit from different strategic guidance
|
||||
|
||||
### Game Complexity
|
||||
- Diplomacy involves significant stochasticity from opponent behavior
|
||||
- Alliance formation and betrayal introduce non-deterministic outcomes
|
||||
- Long game duration (1901-1925) increases variance in results
|
||||
|
||||
### Measurement Scope
|
||||
- Benchmark focuses on win condition (supply centers) rather than other strategic dimensions
|
||||
- Does not explicitly measure negotiation quality, deception capability, or long-term planning
|
||||
- Performance as a single nation (France) may not generalize to other starting positions
|
||||
|
||||
### Technical Constraints
|
||||
- Model errors and API failures can affect game completion rates
|
||||
- Longer generation times may indicate different reasoning patterns but are not directly scored
|
||||
- Prompt formatting differences across model providers may introduce artifacts
|
||||
217
leaderboard/SCORING_EXPLAINED.md
Normal file
217
leaderboard/SCORING_EXPLAINED.md
Normal file
|
|
@ -0,0 +1,217 @@
|
|||
# Scoring Methodology Explanation
|
||||
|
||||
This document explains the two different scoring systems used in the AI Diplomacy benchmark and how they are applied across different metrics and visualizations.
|
||||
|
||||
## Two Scoring Systems Used
|
||||
|
||||
### 1. Raw Supply Center Count
|
||||
|
||||
**What it is:** The simple count of supply centers a power controls at the end of the game.
|
||||
|
||||
**How it's calculated:**
|
||||
- Directly counts the number of supply centers (territories that can produce units) owned by a power when the game ends
|
||||
- Range: 0-34 supply centers (theoretical maximum, though 18 is winning threshold)
|
||||
- No bonuses or penalties applied
|
||||
|
||||
**Example:**
|
||||
- France ends game with 11 supply centers → Raw SC score = 11
|
||||
- France eliminated with 0 supply centers → Raw SC score = 0
|
||||
- France wins solo with 18+ supply centers → Raw SC score = 18-34
|
||||
|
||||
**When it's used:**
|
||||
- Supplementary metric in both CSV files for transparency
|
||||
- Useful for understanding territorial control independent of game outcome timing
|
||||
- Easier to interpret for quick comparisons
|
||||
|
||||
### 2. Custom Game Score (DiploBench-style)
|
||||
|
||||
**What it is:** A sophisticated scoring system that rewards survival, victory speed, and penalizes early elimination.
|
||||
|
||||
**How it's calculated:**
|
||||
|
||||
The formula depends on the game outcome:
|
||||
|
||||
#### Case 1: Solo Winner (18+ supply centers)
|
||||
```
|
||||
score = max_year + (max_year - win_year) + 18
|
||||
```
|
||||
- `max_year`: Maximum game year (typically 1925) minus 1900 = 25
|
||||
- `win_year`: Year when 18 supply centers were reached minus 1900
|
||||
- The `(max_year - win_year)` term rewards faster victories
|
||||
- The `+ 18` bonus represents the solo win achievement
|
||||
|
||||
**Example:**
|
||||
- Game ends in 1925 (max_year = 25)
|
||||
- France wins solo in 1920 (win_year = 20)
|
||||
- Score = 25 + (25 - 20) + 18 = 25 + 5 + 18 = **48 points**
|
||||
- If France won in 1915 instead: 25 + 10 + 18 = **53 points** (faster win = higher score)
|
||||
|
||||
#### Case 2: Survivor (active at game end, no solo winner)
|
||||
```
|
||||
score = max_year + final_supply_centers
|
||||
```
|
||||
|
||||
**Example:**
|
||||
- Game ends in 1925 (max_year = 25)
|
||||
- France survives with 11 supply centers
|
||||
- Score = 25 + 11 = **36 points**
|
||||
- If France had 8 supply centers: 25 + 8 = **33 points**
|
||||
|
||||
#### Case 3: Eliminated (0 supply centers before game end)
|
||||
```
|
||||
score = elimination_year
|
||||
```
|
||||
- `elimination_year`: Year of elimination minus 1900
|
||||
|
||||
**Example:**
|
||||
- France eliminated in 1910
|
||||
- Score = 1910 - 1900 = **10 points**
|
||||
- If eliminated in 1905: score = **5 points** (earlier elimination = lower score)
|
||||
|
||||
#### Case 4: Lost to Solo Winner (still had units but someone else won)
|
||||
```
|
||||
score = win_year
|
||||
```
|
||||
- Same as elimination year but marked differently
|
||||
|
||||
**Key Properties:**
|
||||
- **Winning range:** Typically 42-70+ points (varies by win speed)
|
||||
- **Survival range:** Typically 28-42 points (25 + low SCs to 25 + 17 SCs)
|
||||
- **Elimination range:** Typically 5-24 points (early elimination to late elimination)
|
||||
|
||||
## Which Score is Used Where
|
||||
|
||||
### France Bar Chart Visualization (`france_scores_bar.png`)
|
||||
|
||||
**Score used:** Custom Game Score (DiploBench-style)
|
||||
|
||||
**Rationale:**
|
||||
- The bar chart is titled "Average Game Score" and uses the `game_score` column
|
||||
- This provides a more nuanced view of performance that accounts for survival time and victory quality
|
||||
- Better differentiates between "survived weakly" vs "eliminated late" vs "won fast"
|
||||
|
||||
**Code reference:** `/Users/alxdfy/Documents/mldev/AI_Diplomacy/analyze_diplomacy_performance_v3_textured.py`, lines 595-596
|
||||
|
||||
### Overall Performance Leaderboard (`overall_performance.csv`)
|
||||
|
||||
**Primary metric:** Custom Game Score (columns: `france_mean_score`, `france_median_score`)
|
||||
|
||||
**Secondary metric:** Raw Supply Centers (columns: `raw_supply_centers_mean`, `raw_supply_centers_median`)
|
||||
|
||||
**Why both are included:**
|
||||
- **Game Score** is the primary ranking metric because it better captures overall gameplay quality
|
||||
- **Raw Supply Centers** provides context and transparency about territorial control
|
||||
- Together they give a complete picture: a model might have high territorial control but poor survival time, or vice versa
|
||||
|
||||
**Example interpretation:**
|
||||
- Model A: game_score = 45, raw_sc = 15 → Likely survived to 1925 with moderate territory
|
||||
- Model B: game_score = 15, raw_sc = 8 → Likely eliminated early despite having reasonable territory at that point
|
||||
- Model C: game_score = 55, raw_sc = 19 → Likely won solo or dominated late game
|
||||
|
||||
### Steerability Leaderboard (`steerability.csv`)
|
||||
|
||||
**Primary metric:** Custom Game Score difference (columns: `steerability_score`, `steerability_percentage`)
|
||||
|
||||
**Secondary metric:** Raw Supply Center difference (columns: `steerability_score_raw`, `steerability_percentage_raw`)
|
||||
|
||||
**Why both are included:**
|
||||
- **Game Score steerability** measures the true impact of aggressive prompting on overall performance
|
||||
- **Raw SC steerability** shows the pure territorial control difference
|
||||
- Models can be steerable in different ways:
|
||||
- High game score steerability + low raw SC steerability → Better survival/timing, not just territory
|
||||
- High raw SC steerability + lower game score steerability → More territory but possibly worse timing
|
||||
|
||||
**Example interpretation:**
|
||||
- Model X: steerability_score = +15, steerability_score_raw = +10
|
||||
- Aggressive prompting adds 15 game score points and 10 supply centers
|
||||
- The difference (15 vs 10) suggests better survival timing in addition to more territory
|
||||
|
||||
- Model Y: steerability_score = -5, steerability_score_raw = -2
|
||||
- Aggressive prompting actually hurts performance (negative steerability)
|
||||
- Loses 5 game score points and 2 supply centers
|
||||
- Model performs better with neutral/baseline prompting
|
||||
|
||||
## Why Both Scores Matter
|
||||
|
||||
### Complementary Insights
|
||||
|
||||
1. **Game Score** captures:
|
||||
- Victory quality (how fast)
|
||||
- Survival duration (eliminated when)
|
||||
- Overall strategic success
|
||||
- Risk-reward tradeoffs
|
||||
|
||||
2. **Raw Supply Centers** captures:
|
||||
- Territorial expansion ability
|
||||
- Pure diplomatic/military success
|
||||
- Easier to interpret and compare
|
||||
- Independent of timing considerations
|
||||
|
||||
### Research Value
|
||||
|
||||
Having both metrics allows researchers to:
|
||||
- **Identify interesting patterns:** A model might be excellent at gaining territory (high raw SC) but poor at survival (low game score)
|
||||
- **Understand steerability mechanisms:** Does aggressive prompting lead to more territory, better timing, or both?
|
||||
- **Compare fairly:** Different use cases might prioritize different aspects of performance
|
||||
- **Validate results:** If both metrics agree, it strengthens confidence in the findings
|
||||
|
||||
### Example Use Cases
|
||||
|
||||
**Use Case 1: Deployment Decision**
|
||||
- If you want a model for long games → prioritize `game_score` (measures endurance)
|
||||
- If you want a model for territorial expansion → consider `raw_supply_centers` (measures conquest)
|
||||
|
||||
**Use Case 2: Steerability Analysis**
|
||||
- High `steerability_score` but low `steerability_score_raw` → Model becomes more strategic with timing
|
||||
- Similar values → Steerability primarily affects territorial control
|
||||
- Negative values → Model responds poorly to aggressive prompting
|
||||
|
||||
## Column Definitions
|
||||
|
||||
### overall_performance.csv
|
||||
|
||||
| Column | Definition |
|
||||
|--------|------------|
|
||||
| `model_name` | Base name of the model being benchmarked |
|
||||
| `best_variant` | Which variant (baseline or aggressive) performed better |
|
||||
| `france_mean_score` | Average Custom Game Score across all games (primary ranking metric) |
|
||||
| `france_median_score` | Median Custom Game Score across all games (robust to outliers) |
|
||||
| `raw_supply_centers_mean` | Average raw supply center count at game end |
|
||||
| `raw_supply_centers_median` | Median raw supply center count at game end |
|
||||
| `france_win_rate` | Percentage of games won (18+ supply centers) |
|
||||
| `total_games` | Number of games played |
|
||||
| `avg_phase_time_minutes` | Average time per game phase in minutes |
|
||||
| `error_rate` | Percentage of API/LLM errors during gameplay |
|
||||
|
||||
**Sorting:** Descending by `france_mean_score` (higher is better)
|
||||
|
||||
### steerability.csv
|
||||
|
||||
| Column | Definition |
|
||||
|--------|------------|
|
||||
| `model_name` | Base name of the model being analyzed |
|
||||
| `baseline_mean_score` | Average Custom Game Score for baseline (neutral) prompting |
|
||||
| `aggressive_mean_score` | Average Custom Game Score for aggressive prompting |
|
||||
| `baseline_raw_supply_centers` | Average raw supply centers for baseline prompting |
|
||||
| `aggressive_raw_supply_centers` | Average raw supply centers for aggressive prompting |
|
||||
| `steerability_score` | Difference in Custom Game Score (aggressive - baseline) |
|
||||
| `steerability_percentage` | Percentage change in Custom Game Score ((aggressive - baseline) / baseline * 100) |
|
||||
| `steerability_score_raw` | Difference in raw supply centers (aggressive - baseline) |
|
||||
| `steerability_percentage_raw` | Percentage change in raw supply centers ((aggressive - baseline) / baseline * 100) |
|
||||
| `direction` | "positive" if aggressive improves performance, "negative" if it hurts |
|
||||
| `baseline_games` | Number of baseline games played |
|
||||
| `aggressive_games` | Number of aggressive games played |
|
||||
|
||||
**Sorting:** Descending by `steerability_score` (higher positive values = more steerable toward aggression)
|
||||
|
||||
## Summary
|
||||
|
||||
- **Primary Metric:** Custom Game Score (DiploBench-style) - used for all rankings and bar chart
|
||||
- **Secondary Metric:** Raw Supply Centers - provided for transparency and complementary analysis
|
||||
- **France Bar Chart:** Uses Custom Game Score
|
||||
- **Steerability:** Measures both metrics to understand different aspects of prompt influence
|
||||
- **Both metrics together** provide a complete picture of model performance and behavior
|
||||
|
||||
For questions about the scoring implementation, see:
|
||||
- Game score calculation: `/Users/alxdfy/Documents/mldev/AI_Diplomacy/experiment_runner/analysis/summary.py` (lines 77-118)
|
||||
- Data collection: `/Users/alxdfy/Documents/mldev/AI_Diplomacy/analyze_diplomacy_performance_v3_textured.py` (lines 577-611)
|
||||
203
leaderboard/full_comparison.py
Executable file
203
leaderboard/full_comparison.py
Executable file
|
|
@ -0,0 +1,203 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Auto-discovery leaderboard comparison script.
|
||||
|
||||
Automatically discovers all experiments in the leaderboard/ directory,
|
||||
groups them by model name (baseline vs aggressive variants), and generates
|
||||
comprehensive comparison visualizations.
|
||||
|
||||
Naming convention: {model_name}-baseline and {model_name}-aggressive
|
||||
Example: gpt_5_medium-baseline, gpt_5_medium-aggressive
|
||||
"""
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
import sys
|
||||
|
||||
def discover_experiments(leaderboard_dir="."):
|
||||
"""
|
||||
Discover all experiments in the leaderboard directory.
|
||||
|
||||
Returns:
|
||||
dict: Mapping of display labels to absolute paths
|
||||
"""
|
||||
leaderboard_path = Path(leaderboard_dir)
|
||||
|
||||
if not leaderboard_path.exists():
|
||||
print(f"Error: {leaderboard_dir} directory not found")
|
||||
sys.exit(1)
|
||||
|
||||
experiments = {}
|
||||
|
||||
# Scan for all directories/symlinks in leaderboard folder
|
||||
for item in sorted(leaderboard_path.iterdir()):
|
||||
# Skip the script itself, output directory, temp files, and log files
|
||||
if item.name in ['full_comparison.py', 'leaderboard_comparison', 'temp_leaderboard_paths.json'] or item.name.endswith('.log'):
|
||||
continue
|
||||
|
||||
if item.is_dir() or item.is_symlink():
|
||||
# Get the name and resolve symlink if needed
|
||||
name = item.name
|
||||
resolved_path = item.resolve()
|
||||
|
||||
# Create display label: replace underscores with spaces, capitalize words
|
||||
# e.g., gpt_5_medium-baseline -> GPT-5-medium (baseline)
|
||||
# e.g., sonoma_sky-aggressive -> Sonoma-sky (aggressive)
|
||||
|
||||
if '-baseline' in name:
|
||||
model_name = name.replace('-baseline', '').replace('_', '-')
|
||||
display_label = f"{model_name}"
|
||||
elif '-aggressive' in name:
|
||||
model_name = name.replace('-aggressive', '').replace('_', '-')
|
||||
display_label = f"{model_name}-aggressive"
|
||||
else:
|
||||
# Fallback for any non-standard naming
|
||||
display_label = name.replace('_', '-')
|
||||
|
||||
experiments[display_label] = str(resolved_path)
|
||||
|
||||
return experiments
|
||||
|
||||
def main():
|
||||
print("=== Leaderboard Auto-Discovery Comparison ===\n")
|
||||
|
||||
# Get absolute paths for script location and parent directory
|
||||
script_dir = Path(__file__).resolve().parent
|
||||
parent_dir = script_dir.parent
|
||||
|
||||
# Discover all experiments (using absolute path to leaderboard dir)
|
||||
print("Discovering experiments in leaderboard directory...")
|
||||
experiments = discover_experiments(script_dir)
|
||||
|
||||
if not experiments:
|
||||
print("No experiments found in leaderboard directory")
|
||||
return
|
||||
|
||||
print(f"Found {len(experiments)} experiments:")
|
||||
for label in sorted(experiments.keys()):
|
||||
print(f" - {label}")
|
||||
print()
|
||||
|
||||
# Create output directory (absolute path)
|
||||
output_dir = script_dir / "leaderboard_comparison"
|
||||
output_dir.mkdir(exist_ok=True)
|
||||
print(f"Output directory: {output_dir}/\n")
|
||||
|
||||
# Import visualization functions from parent directory
|
||||
print("Loading analysis modules...")
|
||||
sys.path.insert(0, str(parent_dir))
|
||||
from analyze_diplomacy_performance_v3_textured import (
|
||||
collect_timing_data_v3, create_stacked_bar_chart_v3,
|
||||
collect_move_type_data, create_move_type_chart,
|
||||
collect_error_data, create_error_chart
|
||||
)
|
||||
|
||||
# Collect and generate timing analysis
|
||||
print("\nCollecting timing data (concurrent operations)...")
|
||||
timing_data = collect_timing_data_v3(experiments)
|
||||
if timing_data:
|
||||
create_stacked_bar_chart_v3(timing_data, output_dir / "phase_timing_comparison.png")
|
||||
print(" ✓ Generated phase_timing_comparison.png")
|
||||
|
||||
# Collect and generate move type analysis
|
||||
print("\nCollecting move type data...")
|
||||
move_data = collect_move_type_data(experiments)
|
||||
if move_data:
|
||||
create_move_type_chart(move_data, output_dir / "move_type_comparison.png")
|
||||
print(" ✓ Generated move_type_comparison.png")
|
||||
|
||||
# Collect and generate error analysis
|
||||
print("\nCollecting error data...")
|
||||
error_data = collect_error_data(experiments)
|
||||
if error_data:
|
||||
create_error_chart(error_data, output_dir / "error_comparison.png")
|
||||
print(" ✓ Generated error_comparison.png")
|
||||
|
||||
# Import additional analysis functions
|
||||
print("\nGenerating additional visualizations...")
|
||||
from analyze_diplomacy_performance_v3_textured import (
|
||||
collect_france_scores,
|
||||
plot_france_scores_bar,
|
||||
plot_france_scores_box,
|
||||
plot_diplomatic_credit_heatmap,
|
||||
plot_relative_sentiment
|
||||
)
|
||||
|
||||
# Reverse mapping for compatibility with existing functions
|
||||
path_to_label = {v: k for k, v in experiments.items()}
|
||||
|
||||
# Generate remaining visualizations directly (no subprocess needed)
|
||||
try:
|
||||
# Collect France scores once and reuse
|
||||
print(" Collecting France game scores...")
|
||||
all_scores_df = collect_france_scores(path_to_label)
|
||||
|
||||
if not all_scores_df.empty:
|
||||
print(" Generating score visualizations...")
|
||||
plot_france_scores_bar(all_scores_df, output_dir / "france_scores_bar.png")
|
||||
plot_france_scores_box(all_scores_df, output_dir / "france_scores_box.png")
|
||||
print(" ✓ Generated score visualizations")
|
||||
|
||||
# Generate diplomatic heatmaps
|
||||
print(" Generating diplomatic heatmaps...")
|
||||
plot_diplomatic_credit_heatmap(path_to_label, "other_to_france", output_dir / "heatmap_other_to_france.png")
|
||||
plot_diplomatic_credit_heatmap(path_to_label, "france_to_other", output_dir / "heatmap_france_to_other.png")
|
||||
print(" ✓ Generated heatmaps")
|
||||
|
||||
# Generate relative sentiment chart
|
||||
print(" Generating relative sentiment chart...")
|
||||
plot_relative_sentiment(path_to_label, output_dir / "relative_sentiment.png")
|
||||
print(" ✓ Generated relative sentiment chart")
|
||||
|
||||
print(" ✓ Generated additional visualizations")
|
||||
|
||||
except Exception as e:
|
||||
print(f" ⚠ Warning: Some visualizations may have failed: {e}")
|
||||
|
||||
# List generated files
|
||||
files = sorted(output_dir.glob("*.png"))
|
||||
print(f"\n=== Complete! ===")
|
||||
print(f"Generated {len(files)} visualizations in {output_dir}/:")
|
||||
for f in files:
|
||||
print(f" - {f.name}")
|
||||
|
||||
# Print summary by model family
|
||||
print("\n=== Experiments by Family ===")
|
||||
|
||||
families = {
|
||||
'GPT-5': [],
|
||||
'GPT-OSS': [],
|
||||
'O-series': [],
|
||||
'Gemini': [],
|
||||
'Hermes': [],
|
||||
'Sonoma': [],
|
||||
'Other': []
|
||||
}
|
||||
|
||||
for label in sorted(experiments.keys()):
|
||||
if 'gpt-5' in label.lower():
|
||||
families['GPT-5'].append(label)
|
||||
elif 'gpt-oss' in label.lower():
|
||||
families['GPT-OSS'].append(label)
|
||||
elif label.startswith('o3') or label.startswith('o4'):
|
||||
families['O-series'].append(label)
|
||||
elif 'gemini' in label.lower():
|
||||
families['Gemini'].append(label)
|
||||
elif 'hermes' in label.lower():
|
||||
families['Hermes'].append(label)
|
||||
elif 'sonoma' in label.lower():
|
||||
families['Sonoma'].append(label)
|
||||
else:
|
||||
families['Other'].append(label)
|
||||
|
||||
for family, models in families.items():
|
||||
if models:
|
||||
print(f"\n{family}:")
|
||||
for model in models:
|
||||
print(f" - {model}")
|
||||
|
||||
print("\n✓ Leaderboard comparison complete!")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Loading…
Add table
Add a link
Reference in a new issue