Merge pull request #42 from EveryInc/experiment-runner

Experiment runner
2026-04-19 12:58:09 +00:00 · 2025-06-22 09:34:55 -07:00 · 2025-06-22 09:34:55 -07:00 · 617ccdb1a0
commit 617ccdb1a0
parent 4bc5774248 840c6b0ad9
16 changed files with 943 additions and 80 deletions
--- a/README.md
+++ b/README.md
@ -236,13 +236,92 @@ python lm_game.py --max_year 1910 --planning_phase --num_negotiation_rounds 2
 # Custom model assignment (order: AUSTRIA, ENGLAND, FRANCE, GERMANY, ITALY, RUSSIA, TURKEY)
 python lm_game.py --models "claude-3-5-sonnet-20241022,gpt-4o,claude-3-5-sonnet-20241022,gpt-4o,claude-3-5-sonnet-20241022,gpt-4o,claude-3-5-sonnet-20241022"
 # Output to specific file
 python lm_game.py --output results/my_game.json
 # Run until game completion or specific year
 python lm_game.py --num_negotiation_rounds 2 --planning_phase
 # Write all artefacts to a chosen directory (auto-resumes if it already exists)
 python lm_game.py --run_dir results/game_run_001
 # Resume an interrupted game from a specific phase
 python lm_game.py --run_dir results/game_run_001 --resume_from_phase S1902M
 # Critical-state analysis: resume from an existing run but save new results elsewhere
 python lm_game.py \
  --run_dir results/game_run_001 \
  --critical_state_analysis_dir results/critical_analysis_001 \
  --resume_from_phase F1903M
 # End the simulation after a particular phase regardless of remaining years
 python lm_game.py --run_dir results/game_run_002 --end_at_phase F1905M
 # Set the global max_tokens generation limit
 python lm_game.py --run_dir results/game_run_003 --max_tokens 8000
 # Per-model token limits (AU,EN,FR,GE,IT,RU,TR)
 python lm_game.py --run_dir results/game_run_004 \
  --max_tokens_per_model "8000,8000,16000,8000,8000,16000,8000"
 # Use a custom prompts directory
 python lm_game.py --run_dir results/game_run_005 --prompts_dir ./prompts/my_variants
 ```
 ### Running Batch Experiments with **`experiment_runner.py`**
 `experiment_runner.py` is a lightweight orchestrator: it spins up many `lm_game.py` runs in parallel, gathers their artefacts under one *experiment directory*, and then executes the analysis modules you specify.
 All flags that belong to **`lm_game.py`** can be passed straight through; the runner validates them and forwards them unchanged to every game instance.
 ---
 #### Examples
 ```bash
 # Run 10 independent games (iterations) in parallel, using a custom prompts dir
 # and a single model (GPT-4o) for all seven powers.
 python3 experiment_runner.py \
    --experiment_dir "results/exp001" \
    --iterations 10 \
    --parallel 10 \
    --max_year 1905 \
    --num_negotiation_rounds 0 \
    --prompts_dir "ai_diplomacy/prompts" \
    --models "gpt-4o,gpt-4o,gpt-4o,gpt-4o,gpt-4o,gpt-4o,gpt-4o"
 # Critical-state analysis: resume every run from W1901A (taken from an existing
 # base run) and stop after S1902M.  Two analysis modules will be executed:
 #  • summary         → aggregated results & scores
 #  • critical_state  → before/after snapshots around the critical phase
 python3 experiment_runner.py \
    --experiment_dir "results/exp002" \
    --iterations 10 \
    --parallel 10 \
    --resume_from_phase W1901A \
    --end_at_phase S1902M \
    --num_negotiation_rounds 0 \
    --critical_state_base_run "results/test1" \
    --prompts_dir "ai_diplomacy/prompts" \
    --analysis_modules "summary,critical_state" \
    --models "gpt-4o,gpt-4o,gpt-4o,gpt-4o,gpt-4o,gpt-4o,gpt-4o"
 ```
 *(Any other `lm_game.py` flags—`--planning_phase`, `--max_tokens`, etc.—can be added exactly where you’d use them on a single-game run.)*
 ---
 #### Experiment-runner–specific arguments
 | Flag                              | Type / Default             | Description                                                                                                                                                                              |
 | --------------------------------- | -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `--experiment_dir` **(required)** | `Path`                     | Root folder for the experiment; sub-folders `runs/` and `analysis/` are managed automatically. Re-running with the same directory will **resume** existing runs and regenerate analysis. |
 | `--iterations`                    | `int`, default `1`         | How many individual games to launch for this experiment.                                                                                                                                 |
 | `--parallel`                      | `int`, default `1`         | Max number of games to execute concurrently (uses a process pool).                                                                                                                       |
 | `--analysis_modules`              | `str`, default `"summary"` | Comma-separated list of analysis modules to run after all games finish. Modules are imported from `experiment_runner.analysis.<name>` and must provide `run(experiment_dir, ctx)`.       |
 | `--critical_state_base_run`       | `Path`, optional           | Path to an **existing** `run_dir` produced by a previous `lm_game` run. Each iteration resumes from that snapshot; new artefacts are written under the current `experiment_dir`.         |
 | `--seed_base`                     | `int`, default `42`        | Base random seed. Run *ɪ* receives seed = `seed_base + ɪ`, enabling reproducible batches.                                                                                                |
 *(All other command-line flags belong to `lm_game.py` and are forwarded unchanged.)*
 ### Environment Setup
 Create a `.env` file with your API keys:
--- a/ai_diplomacy/agent.py
+++ b/ai_diplomacy/agent.py
@ -22,13 +22,17 @@ ALL_POWERS = frozenset({"AUSTRIA", "ENGLAND", "FRANCE", "GERMANY", "ITALY", "RUS
 ALLOWED_RELATIONSHIPS = ["Enemy", "Unfriendly", "Neutral", "Friendly", "Ally"]
 # == New: Helper function to load prompt files reliably ==
-def _load_prompt_file(filename: str) -> Optional[str]:
+def _load_prompt_file(filename: str, prompts_dir: Optional[str] = None) -> Optional[str]:
    """Loads a prompt template from the prompts directory."""
    try:
-        # Construct path relative to this file's location
+        if prompts_dir:
-        current_dir = os.path.dirname(os.path.abspath(__file__))
+            filepath = os.path.join(prompts_dir, filename)
-        prompts_dir = os.path.join(current_dir, 'prompts')
+        else:
-        filepath = os.path.join(prompts_dir, filename)
+            # Construct path relative to this file's location
            current_dir = os.path.dirname(os.path.abspath(__file__))
            default_prompts_dir = os.path.join(current_dir, 'prompts')
            filepath = os.path.join(default_prompts_dir, filename)
        with open(filepath, 'r', encoding='utf-8') as f:
            return f.read()
    except FileNotFoundError:
@ -50,6 +54,7 @@ class DiplomacyAgent:
        client: BaseModelClient, 
        initial_goals: Optional[List[str]] = None,
        initial_relationships: Optional[Dict[str, str]] = None,
        prompts_dir: Optional[str] = None,
    ):
        """
        Initializes the DiplomacyAgent.
@ -60,12 +65,14 @@ class DiplomacyAgent:
            initial_goals: An optional list of initial strategic goals.
            initial_relationships: An optional dictionary mapping other power names to 
                                     relationship statuses (e.g., 'ALLY', 'ENEMY', 'NEUTRAL').
            prompts_dir: Optional path to the prompts directory.
        """
        if power_name not in ALL_POWERS:
            raise ValueError(f"Invalid power name: {power_name}. Must be one of {ALL_POWERS}")
        self.power_name: str = power_name
        self.client: BaseModelClient = client
        self.prompts_dir: Optional[str] = prompts_dir
        # Initialize goals as empty list, will be populated by initialize_agent_state
        self.goals: List[str] = initial_goals if initial_goals is not None else [] 
        # Initialize relationships to Neutral if not provided
@ -85,16 +92,21 @@ class DiplomacyAgent:
        # Get the directory containing the current file (agent.py)
        current_dir = os.path.dirname(os.path.abspath(__file__))
        # Construct path relative to the current file's directory
-        prompts_dir = os.path.join(current_dir, "prompts") 
+        default_prompts_path = os.path.join(current_dir, "prompts") 
-        power_prompt_filename = os.path.join(prompts_dir, f"{power_name.lower()}_system_prompt.txt")
+        power_prompt_filename = f"{power_name.lower()}_system_prompt.txt"
-        default_prompt_filename = os.path.join(prompts_dir, "system_prompt.txt")
+        default_prompt_filename = "system_prompt.txt"
-        system_prompt_content = load_prompt(power_prompt_filename)
+        # Use the provided prompts_dir if available, otherwise use the default
        prompts_path_to_use = self.prompts_dir if self.prompts_dir else default_prompts_path
        power_prompt_filepath = os.path.join(prompts_path_to_use, power_prompt_filename)
        default_prompt_filepath = os.path.join(prompts_path_to_use, default_prompt_filename)
        system_prompt_content = load_prompt(power_prompt_filepath, prompts_dir=self.prompts_dir)
        if not system_prompt_content:
-            logger.warning(f"Power-specific prompt '{power_prompt_filename}' not found or empty. Loading default system prompt.")
+            logger.warning(f"Power-specific prompt '{power_prompt_filepath}' not found or empty. Loading default system prompt.")
-            # system_prompt_content = load_prompt("system_prompt.txt")
+            system_prompt_content = load_prompt(default_prompt_filepath, prompts_dir=self.prompts_dir)
            system_prompt_content = load_prompt(default_prompt_filename)
        else:
             logger.info(f"Loaded power-specific system prompt for {power_name}.")
        # ----------------------------------------------------
@ -416,7 +428,7 @@ class DiplomacyAgent:
        try:
            # Load the template file but safely preprocess it first
-            prompt_template_content = _load_prompt_file('negotiation_diary_prompt.txt')
+            prompt_template_content = _load_prompt_file('negotiation_diary_prompt.txt', prompts_dir=self.prompts_dir)
            if not prompt_template_content:
                logger.error(f"[{self.power_name}] Could not load negotiation_diary_prompt.txt. Skipping diary entry.")
                success_status = "Failure: Prompt file not loaded"
@ -611,7 +623,7 @@ class DiplomacyAgent:
        logger.info(f"[{self.power_name}] Generating order diary entry for {game.current_short_phase}...")
        # Load the template but we'll use it carefully with string interpolation
-        prompt_template = _load_prompt_file('order_diary_prompt.txt')
+        prompt_template = _load_prompt_file('order_diary_prompt.txt', prompts_dir=self.prompts_dir)
        if not prompt_template:
            logger.error(f"[{self.power_name}] Could not load order_diary_prompt.txt. Skipping diary entry.")
            return
@ -756,7 +768,7 @@ class DiplomacyAgent:
        logger.info(f"[{self.power_name}] Generating phase result diary entry for {game.current_short_phase}...")
        # Load the template
-        prompt_template = _load_prompt_file('phase_result_diary_prompt.txt')
+        prompt_template = _load_prompt_file('phase_result_diary_prompt.txt', prompts_dir=self.prompts_dir)
        if not prompt_template:
            logger.error(f"[{self.power_name}] Could not load phase_result_diary_prompt.txt. Skipping diary entry.")
            return
@ -859,7 +871,7 @@ class DiplomacyAgent:
        try:
            # 1. Construct the prompt using the dedicated state update prompt file
-            prompt_template = _load_prompt_file('state_update_prompt.txt')
+            prompt_template = _load_prompt_file('state_update_prompt.txt', prompts_dir=self.prompts_dir)
            if not prompt_template:
                 logger.error(f"[{power_name}] Could not load state_update_prompt.txt. Skipping state update.")
                 return
@ -893,6 +905,7 @@ class DiplomacyAgent:
                agent_goals=self.goals,
                agent_relationships=self.relationships,
                agent_private_diary=formatted_diary, # Pass formatted diary
                prompts_dir=self.prompts_dir,
            )
            # Add previous phase summary to the information provided to the LLM
--- a/ai_diplomacy/clients.py
+++ b/ai_diplomacy/clients.py
@ -44,10 +44,11 @@ class BaseModelClient:
      - get_conversation_reply(power_name, conversation_so_far, game_phase) -> str
    """
-    def __init__(self, model_name: str):
+    def __init__(self, model_name: str, prompts_dir: Optional[str] = None):
        self.model_name = model_name
        self.prompts_dir = prompts_dir
        # Load a default initially, can be overwritten by set_system_prompt
-        self.system_prompt = load_prompt("system_prompt.txt") 
+        self.system_prompt = load_prompt("system_prompt.txt", prompts_dir=self.prompts_dir) 
        self.max_tokens = 16000  # default unless overridden
    def set_system_prompt(self, content: str):
@ -97,6 +98,7 @@ class BaseModelClient:
            agent_goals=agent_goals,
            agent_relationships=agent_relationships,
            agent_private_diary_str=agent_private_diary_str,
            prompts_dir=self.prompts_dir,
        )
        raw_response = ""
@ -423,7 +425,7 @@ class BaseModelClient:
        agent_private_diary_str: Optional[str] = None, # Added
    ) -> str:
-        instructions = load_prompt("planning_instructions.txt")
+        instructions = load_prompt("planning_instructions.txt", prompts_dir=self.prompts_dir)
        context = self.build_context_prompt(
            game,
@ -434,6 +436,7 @@ class BaseModelClient:
            agent_goals=agent_goals,
            agent_relationships=agent_relationships,
            agent_private_diary=agent_private_diary_str, # Pass diary string
            prompts_dir=self.prompts_dir,
        )
        return context + "\n\n" + instructions
@ -451,7 +454,7 @@ class BaseModelClient:
        agent_relationships: Optional[Dict[str, str]] = None,
        agent_private_diary_str: Optional[str] = None, # Added
    ) -> str:
-        instructions = load_prompt("conversation_instructions.txt")
+        instructions = load_prompt("conversation_instructions.txt", prompts_dir=self.prompts_dir)
        context = build_context_prompt(
            game,
@ -462,6 +465,7 @@ class BaseModelClient:
            agent_goals=agent_goals,
            agent_relationships=agent_relationships,
            agent_private_diary=agent_private_diary_str, # Pass diary string
            prompts_dir=self.prompts_dir,
        )
        # Get recent messages targeting this power to prioritize responses
@ -699,7 +703,7 @@ class BaseModelClient:
        """
        logger.info(f"Client generating strategic plan for {power_name}...")
-        planning_instructions = load_prompt("planning_instructions.txt")
+        planning_instructions = load_prompt("planning_instructions.txt", prompts_dir=self.prompts_dir)
        if not planning_instructions:
            logger.error("Could not load planning_instructions.txt! Cannot generate plan.")
            return "Error: Planning instructions not found."
@ -718,6 +722,7 @@ class BaseModelClient:
            agent_goals=agent_goals,
            agent_relationships=agent_relationships,
            agent_private_diary=agent_private_diary_str, # Pass diary string
            prompts_dir=self.prompts_dir,
        )
        full_prompt = f"{context_prompt}\n\n{planning_instructions}"
@ -772,8 +777,8 @@ class OpenAIClient(BaseModelClient):
    For 'o3-mini', 'gpt-4o', or other OpenAI model calls.
    """
-    def __init__(self, model_name: str):
+    def __init__(self, model_name: str, prompts_dir: Optional[str] = None):
-        super().__init__(model_name)
+        super().__init__(model_name, prompts_dir=prompts_dir)
        self.client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    async def generate_response(self, prompt: str, temperature: float = 0.0, inject_random_seed: bool = True) -> str:
@ -819,8 +824,8 @@ class ClaudeClient(BaseModelClient):
    For 'claude-3-5-sonnet-20241022', 'claude-3-5-haiku-20241022', etc.
    """
-    def __init__(self, model_name: str):
+    def __init__(self, model_name: str, prompts_dir: Optional[str] = None):
-        super().__init__(model_name)
+        super().__init__(model_name, prompts_dir=prompts_dir)
        self.client = AsyncAnthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
    async def generate_response(self, prompt: str, temperature: float = 0.0, inject_random_seed: bool = True) -> str:
@ -861,8 +866,8 @@ class GeminiClient(BaseModelClient):
    For 'gemini-1.5-flash' or other Google Generative AI models.
    """
-    def __init__(self, model_name: str):
+    def __init__(self, model_name: str, prompts_dir: Optional[str] = None):
-        super().__init__(model_name)
+        super().__init__(model_name, prompts_dir=prompts_dir)
        # Configure and get the model (corrected initialization)
        api_key = os.environ.get("GEMINI_API_KEY")
        if not api_key:
@ -905,8 +910,8 @@ class DeepSeekClient(BaseModelClient):
    For DeepSeek R1 'deepseek-reasoner'
    """
-    def __init__(self, model_name: str):
+    def __init__(self, model_name: str, prompts_dir: Optional[str] = None):
-        super().__init__(model_name)
+        super().__init__(model_name, prompts_dir=prompts_dir)
        self.api_key = os.environ.get("DEEPSEEK_API_KEY")
        self.client = AsyncDeepSeekOpenAI(
            api_key=self.api_key, 
@ -961,8 +966,8 @@ class OpenAIResponsesClient(BaseModelClient):
    This client makes direct HTTP requests to the v1/responses endpoint.
    """
-    def __init__(self, model_name: str):
+    def __init__(self, model_name: str, prompts_dir: Optional[str] = None):
-        super().__init__(model_name)
+        super().__init__(model_name, prompts_dir=prompts_dir)
        self.api_key = os.environ.get("OPENAI_API_KEY")
        if not self.api_key:
            raise ValueError("OPENAI_API_KEY environment variable is required")
@ -1068,14 +1073,14 @@ class OpenRouterClient(BaseModelClient):
    For OpenRouter models, with default being 'openrouter/quasar-alpha'
    """
-    def __init__(self, model_name: str = "openrouter/quasar-alpha"):
+    def __init__(self, model_name: str = "openrouter/quasar-alpha", prompts_dir: Optional[str] = None):
        # Allow specifying just the model identifier or the full path
        if not model_name.startswith("openrouter/") and "/" not in model_name:
            model_name = f"openrouter/{model_name}"
        if model_name.startswith("openrouter-"):
            model_name = model_name.replace("openrouter-", "")
-        super().__init__(model_name)
+        super().__init__(model_name, prompts_dir=prompts_dir)
        self.api_key = os.environ.get("OPENROUTER_API_KEY")
        if not self.api_key:
            raise ValueError("OPENROUTER_API_KEY environment variable is required")
@ -1146,8 +1151,8 @@ class TogetherAIClient(BaseModelClient):
    Model names should be passed without the 'together-' prefix.
    """
-    def __init__(self, model_name: str):
+    def __init__(self, model_name: str, prompts_dir: Optional[str] = None):
-        super().__init__(model_name)  # model_name here is the actual Together AI model identifier
+        super().__init__(model_name, prompts_dir=prompts_dir)  # model_name here is the actual Together AI model identifier
        self.api_key = os.environ.get("TOGETHER_API_KEY")
        if not self.api_key:
            raise ValueError("TOGETHER_API_KEY environment variable is required for TogetherAIClient")
@ -1198,12 +1203,13 @@ class TogetherAIClient(BaseModelClient):
 ##############################################################################
-def load_model_client(model_id: str) -> BaseModelClient:
+def load_model_client(model_id: str, prompts_dir: Optional[str] = None) -> BaseModelClient:
    """
    Returns the appropriate LLM client for a given model_id string.
    Args:
        model_id: The model identifier
        prompts_dir: Optional path to the prompts directory.
    Example usage:
       client = load_model_client("claude-3-5-sonnet-20241022")
@ -1213,23 +1219,23 @@ def load_model_client(model_id: str) -> BaseModelClient:
    # Check for o3-pro model specifically - it needs the Responses API
    if lower_id == "o3-pro":
-        return OpenAIResponsesClient(model_id)
+        return OpenAIResponsesClient(model_id, prompts_dir=prompts_dir)
    # Check for OpenRouter first to handle prefixed models like openrouter-deepseek
    elif model_id.startswith("together-"):
        actual_model_name = model_id.split("together-", 1)[1]
        logger.info(f"Loading TogetherAI client for model: {actual_model_name} (original ID: {model_id})")
-        return TogetherAIClient(actual_model_name)
+        return TogetherAIClient(actual_model_name, prompts_dir=prompts_dir)
    elif "openrouter" in model_id.lower() or "/" in model_id: # More general check for OpenRouterClient(model_id)
-        return OpenRouterClient(model_id)
+        return OpenRouterClient(model_id, prompts_dir=prompts_dir)
    elif "claude" in lower_id:
-        return ClaudeClient(model_id)
+        return ClaudeClient(model_id, prompts_dir=prompts_dir)
    elif "gemini" in lower_id:
-        return GeminiClient(model_id)
+        return GeminiClient(model_id, prompts_dir=prompts_dir)
    elif "deepseek" in lower_id:
-        return DeepSeekClient(model_id)
+        return DeepSeekClient(model_id, prompts_dir=prompts_dir)
    else:
        # Default to OpenAI (for models like o3-mini, gpt-4o, etc.)
-        return OpenAIClient(model_id)
+        return OpenAIClient(model_id, prompts_dir=prompts_dir)
 ##############################################################################
@ -1249,4 +1255,4 @@ def get_visible_messages_for_power(conversation_messages, power_name):
            or msg["recipient"] == power_name
        ):
            visible.append(msg)
-    return visible  # already in chronological order if appended that way
+    return visible  # already in chronological order if appended that way
--- a/ai_diplomacy/diary_logic.py
+++ b/ai_diplomacy/diary_logic.py
@ -1,7 +1,7 @@
 # ai_diplomacy/diary_logic.py
 import logging
 import re
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Optional
 from .utils import run_llm_and_log, log_llm_response
@ -11,13 +11,17 @@ if TYPE_CHECKING:
 logger = logging.getLogger(__name__)
-def _load_prompt_file(filename: str) -> str | None:
+def _load_prompt_file(filename: str, prompts_dir: Optional[str] = None) -> str | None:
    """A local copy of the helper from agent.py to avoid circular imports."""
    import os
    try:
-        current_dir = os.path.dirname(os.path.abspath(__file__))
+        if prompts_dir:
-        prompts_dir = os.path.join(current_dir, 'prompts')
+            filepath = os.path.join(prompts_dir, filename)
-        filepath = os.path.join(prompts_dir, filename)
+        else:
            current_dir = os.path.dirname(os.path.abspath(__file__))
            default_prompts_dir = os.path.join(current_dir, 'prompts')
            filepath = os.path.join(default_prompts_dir, filename)
        with open(filepath, 'r', encoding='utf-8') as f:
            return f.read()
    except Exception as e:
@ -29,6 +33,7 @@ async def run_diary_consolidation(
    game: "Game",
    log_file_path: str,
    entries_to_keep_unsummarized: int = 15,
    prompts_dir: Optional[str] = None,
 ):
    """
    Consolidate older diary entries while keeping recent ones.
@ -93,7 +98,7 @@ async def run_diary_consolidation(
        )
        return
-    prompt_template = _load_prompt_file("diary_consolidation_prompt.txt")
+    prompt_template = _load_prompt_file("diary_consolidation_prompt.txt", prompts_dir=prompts_dir)
    if not prompt_template:
        logger.error(
            f"[{agent.power_name}] diary_consolidation_prompt.txt missing — aborting"
--- a/ai_diplomacy/game_logic.py
+++ b/ai_diplomacy/game_logic.py
@ -31,16 +31,17 @@ def serialize_agent(agent: DiplomacyAgent) -> dict:
        "private_diary": agent.private_diary,
    }
-def deserialize_agent(agent_data: dict) -> DiplomacyAgent:
+def deserialize_agent(agent_data: dict, prompts_dir: Optional[str] = None) -> DiplomacyAgent:
    """Recreates an agent object from a dictionary."""
-    client = load_model_client(agent_data["model_id"])
+    client = load_model_client(agent_data["model_id"], prompts_dir=prompts_dir)
    client.max_tokens = agent_data.get("max_tokens", 16000) # Default for older saves
    agent = DiplomacyAgent(
        power_name=agent_data["power_name"],
        client=client,
        initial_goals=agent_data.get("goals", []),
-        initial_relationships=agent_data.get("relationships", None)
+        initial_relationships=agent_data.get("relationships", None),
        prompts_dir=prompts_dir
    )
    # Restore the diary.
    agent.full_private_diary = agent_data.get("full_private_diary", [])
@ -179,7 +180,7 @@ def load_game_state(
    run_dir: str,
    game_file_name: str,
    resume_from_phase: Optional[str] = None
-) -> Tuple[Game, Dict[str, DiplomacyAgent], GameHistory, Namespace]:
+) -> Tuple[Game, Dict[str, DiplomacyAgent], GameHistory, Optional[Namespace]]:
    """Loads and reconstructs the game state from a saved game file."""
    game_file_path = os.path.join(run_dir, game_file_name)
    if not os.path.exists(game_file_path):
@ -189,6 +190,14 @@ def load_game_state(
    with open(game_file_path, 'r') as f:
        saved_game_data = json.load(f)
    # Find the latest config saved in the file
    run_config = None
    if saved_game_data.get("phases"):
        for phase in reversed(saved_game_data["phases"]):
            if "config" in phase:
                run_config = Namespace(**phase["config"])
                logger.info(f"Loaded run configuration from phase {phase['name']}.")
                break
    # If resuming, find the specified phase and truncate the data after it
    if resume_from_phase:
@ -231,8 +240,6 @@ def load_game_state(
        logger.info("No previous phases found. Initializing fresh agents and history.")
        agents = {} # Will be created by the main loop
        game_history = GameHistory()
        # No config to load, will use current run's args
        run_config = None 
    else:
        # We save the game state up to & including the current (uncompleted) phase.
        # So we need to grab the agent state from the previous (completed) phase.
@ -245,8 +252,9 @@ def load_game_state(
        agents = {}
        if 'state_agents' in last_phase_data:
            logger.info("Rebuilding agents from saved state...")
            prompts_dir_from_config = run_config.prompts_dir if run_config and hasattr(run_config, 'prompts_dir') else None
            for power_name, agent_data in last_phase_data['state_agents'].items():
-                agents[power_name] = deserialize_agent(agent_data)
+                agents[power_name] = deserialize_agent(agent_data, prompts_dir=prompts_dir_from_config)
            logger.info(f"Rebuilt {len(agents)} agents.")
        else:
            raise ValueError("Cannot resume: 'state_agents' key not found in the last phase of the save file.")
@ -267,7 +275,7 @@ def load_game_state(
        logger.info("Game history rebuilt.")
-    return game, agents, game_history
+    return game, agents, game_history, run_config
 async def initialize_new_game(
@ -308,12 +316,12 @@ async def initialize_new_game(
    for power_name, model_id in game.power_model_map.items():
        if not game.powers[power_name].is_eliminated():
            try:
-                client = load_model_client(model_id)
+                client = load_model_client(model_id, prompts_dir=args.prompts_dir)
                client.max_tokens = model_max_tokens[power_name]
-                agent = DiplomacyAgent(power_name=power_name, client=client)
+                agent = DiplomacyAgent(power_name=power_name, client=client, prompts_dir=args.prompts_dir)
                agents[power_name] = agent
                logger.info(f"Preparing initialization task for {power_name} with model {model_id}")
-                initialization_tasks.append(initialize_agent_state_ext(agent, game, game_history, llm_log_file_path))
+                initialization_tasks.append(initialize_agent_state_ext(agent, game, game_history, llm_log_file_path, prompts_dir=args.prompts_dir))
            except Exception as e:
                logger.error(f"Failed to create agent or client for {power_name} with model {model_id}: {e}", exc_info=True)
--- a/ai_diplomacy/initialization.py
+++ b/ai_diplomacy/initialization.py
@ -1,6 +1,7 @@
 # ai_diplomacy/initialization.py
 import logging
 import json
 from typing import Optional
 # Forward declaration for type hinting, actual imports in function if complex
 if False: # TYPE_CHECKING
@ -18,7 +19,8 @@ async def initialize_agent_state_ext(
    agent: 'DiplomacyAgent', 
    game: 'Game', 
    game_history: 'GameHistory', 
-    log_file_path: str
+    log_file_path: str,
    prompts_dir: Optional[str] = None,
 ):
    """Uses the LLM to set initial goals and relationships for the agent."""
    power_name = agent.power_name
@ -56,7 +58,8 @@ async def initialize_agent_state_ext(
            game_history=game_history, 
            agent_goals=None, 
            agent_relationships=None, 
-            agent_private_diary=formatted_diary, 
+            agent_private_diary=formatted_diary,
            prompts_dir=prompts_dir,
        )
        full_prompt = initial_prompt + "\n\n" + context
--- a/ai_diplomacy/prompt_constructor.py
+++ b/ai_diplomacy/prompt_constructor.py
@ -23,6 +23,7 @@ def build_context_prompt(
    agent_goals: Optional[List[str]] = None,
    agent_relationships: Optional[Dict[str, str]] = None,
    agent_private_diary: Optional[str] = None,
    prompts_dir: Optional[str] = None,
 ) -> str:
    """Builds the detailed context part of the prompt.
@ -35,11 +36,12 @@ def build_context_prompt(
        agent_goals: Optional list of agent's goals.
        agent_relationships: Optional dictionary of agent's relationships with other powers.
        agent_private_diary: Optional string of agent's private diary.
        prompts_dir: Optional path to the prompts directory.
    Returns:
        A string containing the formatted context.
    """
-    context_template = load_prompt("context_prompt.txt")
+    context_template = load_prompt("context_prompt.txt", prompts_dir=prompts_dir)
    # === Agent State Debug Logging ===
    if agent_goals:
@ -112,6 +114,7 @@ def construct_order_generation_prompt(
    agent_goals: Optional[List[str]] = None,
    agent_relationships: Optional[Dict[str, str]] = None,
    agent_private_diary_str: Optional[str] = None,
    prompts_dir: Optional[str] = None,
 ) -> str:
    """Constructs the final prompt for order generation.
@ -125,13 +128,14 @@ def construct_order_generation_prompt(
        agent_goals: Optional list of agent's goals.
        agent_relationships: Optional dictionary of agent's relationships with other powers.
        agent_private_diary_str: Optional string of agent's private diary.
        prompts_dir: Optional path to the prompts directory.
    Returns:
        A string containing the complete prompt for the LLM.
    """
    # Load prompts
-    _ = load_prompt("few_shot_example.txt") # Loaded but not used, as per original logic
+    _ = load_prompt("few_shot_example.txt", prompts_dir=prompts_dir) # Loaded but not used, as per original logic
-    instructions = load_prompt("order_instructions.txt")
+    instructions = load_prompt("order_instructions.txt", prompts_dir=prompts_dir)
    # Build the context prompt
    context = build_context_prompt(
@ -143,7 +147,8 @@ def construct_order_generation_prompt(
        agent_goals=agent_goals,
        agent_relationships=agent_relationships,
        agent_private_diary=agent_private_diary_str,
        prompts_dir=prompts_dir,
    )
    final_prompt = system_prompt + "\n\n" + context + "\n\n" + instructions
-    return final_prompt
+    return final_prompt
--- a/ai_diplomacy/utils.py
+++ b/ai_diplomacy/utils.py
@ -293,11 +293,14 @@ def normalize_and_compare_orders(
 # Helper to load prompt text from file relative to the expected 'prompts' dir
-def load_prompt(filename: str) -> str:
+def load_prompt(filename: str, prompts_dir: Optional[str] = None) -> str:
    """Helper to load prompt text from file"""
-    # Assuming execution from the root or that the path resolves correctly
+    if prompts_dir:
-    # Consider using absolute paths or pkg_resources if needed for robustness
+        prompt_path = os.path.join(prompts_dir, filename)
-    prompt_path = os.path.join(os.path.dirname(__file__), 'prompts', filename)
+    else:
        # Default behavior: relative to this file's location in the 'prompts' subdir
        prompt_path = os.path.join(os.path.dirname(__file__), 'prompts', filename)
    try:
        with open(prompt_path, "r", encoding='utf-8') as f: # Added encoding
            return f.read().strip()
--- a/experiment_runner.py
+++ b/experiment_runner.py
@ -0,0 +1,474 @@
 #!/usr/bin/env python3
 """
 Experiment orchestration for Diplomacy self-play.
 Launches many `lm_game` runs in parallel, captures their artefacts,
 and executes a pluggable post-analysis pipeline.
 Run  `python experiment_runner.py --help`  for CLI details.
 """
 from __future__ import annotations
 import argparse
 import collections
 import concurrent.futures
 import importlib
 import json
 import logging
 import math
 import os
 import shutil
 import subprocess
 import sys
 import textwrap
 import time
 import multiprocessing as mp
 from datetime import datetime
 from pathlib import Path
 from types import SimpleNamespace
 from typing import Iterable, List
 # --------------------------------------------------------------------------- #
 #  Logging                                                                    #
 # --------------------------------------------------------------------------- #
 LOG_FMT = "%(asctime)s [%(levelname)s] %(name)s - %(message)s"
 logging.basicConfig(level=logging.INFO, format=LOG_FMT, datefmt="%H:%M:%S")
 log = logging.getLogger("experiment_runner")
 # ────────────────────────────────────────────────────────────────────────────
 #  Flag definitions – full, un-shortened help strings                        #
 # ────────────────────────────────────────────────────────────────────────────
 def _add_experiment_flags(p: argparse.ArgumentParser) -> None:
    p.add_argument(
        "--experiment_dir",
        type=Path,
        required=True,
        help=(
            "Directory that will hold all experiment artefacts. "
            "A 'runs/' sub-folder is created for individual game runs and an "
            "'analysis/' folder for aggregated outputs.  Must be writable."
        ),
    )
    p.add_argument(
        "--iterations",
        type=int,
        default=1,
        help=(
            "Number of lm_game instances to launch for this experiment.  "
            "Each instance gets its own sub-directory under runs/."
        ),
    )
    p.add_argument(
        "--parallel",
        type=int,
        default=1,
        help=(
            "Maximum number of game instances to run concurrently.  "
            "Uses a ProcessPoolExecutor under the hood."
        ),
    )
    p.add_argument(
        "--analysis_modules",
        type=str,
        default="summary",
        help=(
            "Comma-separated list of analysis module names to execute after all "
            "runs finish.  Modules are imported from "
            "'experiment_runner.analysis.<name>' and must expose "
            "run(experiment_dir: Path, ctx: dict)."
        ),
    )
    p.add_argument(
        "--critical_state_base_run",
        type=Path,
        default=None,
        help=(
            "Path to an *existing* run directory produced by a previous lm_game "
            "execution.  When supplied, every iteration resumes from that "
            "snapshot using lm_game's --critical_state_analysis_dir mechanism."
        ),
    )
    p.add_argument(
        "--seed_base",
        type=int,
        default=42,
        help=(
            "Base RNG seed.  Run i will receive seed = seed_base + i.  "
            "Forwarded to lm_game via its --seed flag (you must have added that "
            "flag to lm_game for deterministic behaviour)."
        ),
    )
 def _add_lm_game_flags(p: argparse.ArgumentParser) -> None:
    # ---- all flags copied verbatim from lm_game.parse_arguments() ----
    p.add_argument(
        "--resume_from_phase",
        type=str,
        default="",
        help=(
            "Phase to resume from (e.g., 'S1902M'). Requires --run_dir. "
            "IMPORTANT: This option clears any existing phase results ahead of "
            "& including the specified resume phase."
        ),
    )
    p.add_argument(
        "--end_at_phase",
        type=str,
        default="",
        help="Phase to end the simulation after (e.g., 'F1905M').",
    )
    p.add_argument(
        "--max_year",
        type=int,
        default=1910,  # Increased default in lm_game
        help="Maximum year to simulate. The game will stop once this year is reached.",
    )
    p.add_argument(
        "--num_negotiation_rounds",
        type=int,
        default=0,
        help="Number of negotiation rounds per phase.",
    )
    p.add_argument(
        "--models",
        type=str,
        default="",
        help=(
            "Comma-separated list of model names to assign to powers in order. "
            "The order is: AUSTRIA, ENGLAND, FRANCE, GERMANY, ITALY, RUSSIA, TURKEY."
        ),
    )
    p.add_argument(
        "--planning_phase",
        action="store_true",
        help="Enable the planning phase for each power to set strategic directives.",
    )
    p.add_argument(
        "--max_tokens",
        type=int,
        default=16000,
        help="Maximum number of new tokens to generate per LLM call (default: 16000).",
    )
    p.add_argument(
        "--max_tokens_per_model",
        type=str,
        default="",
        help=(
            "Comma-separated list of 7 token limits (in order: AUSTRIA, ENGLAND, "
            "FRANCE, GERMANY, ITALY, RUSSIA, TURKEY). Overrides --max_tokens."
        ),
    )
    p.add_argument(
        "--prompts_dir",
        type=str,
        default=None,
        help=(
            "Path to the directory containing prompt files. "
            "Defaults to the packaged prompts directory."
        ),
    )
 # ────────────────────────────────────────────────────────────────────────────
 #  One combined parser for banner printing                                    #
 # ────────────────────────────────────────────────────────────────────────────
 def _build_full_parser() -> argparse.ArgumentParser:
    fp = argparse.ArgumentParser(
        prog="experiment_runner.py",
        formatter_class=lambda prog: argparse.RawTextHelpFormatter(
            prog, max_help_position=45
        ),
        description=(
            "Batch-runner for Diplomacy self-play experiments.  "
            "All lm_game flags are accepted here as-is; they are validated "
            "before any game runs start."
        ),
    )
    _add_experiment_flags(fp)
    _add_lm_game_flags(fp)
    return fp
 # ────────────────────────────────────────────────────────────────────────────
 #  Robust parsing that always shows *full* help on error                      #
 # ────────────────────────────────────────────────────────────────────────────
 def _parse_cli() -> tuple[argparse.Namespace, list[str], argparse.Namespace]:
    full_parser = _build_full_parser()
    # Show full banner when no args
    if len(sys.argv) == 1:
        full_parser.print_help(sys.stderr)
        sys.exit(2)
    # Show full banner on explicit help
    if any(tok in ("-h", "--help") for tok in sys.argv[1:]):
        full_parser.print_help(sys.stderr)
        sys.exit(0)
    # Sub-parsers for separating experiment vs game flags
    class _ErrParser(argparse.ArgumentParser):
        def error(self, msg):
            full_parser.print_help(sys.stderr)
            self.exit(2, f"{self.prog}: error: {msg}\n")
    exp_parser  = _ErrParser(add_help=False)
    game_parser = _ErrParser(add_help=False)
    _add_experiment_flags(exp_parser)
    _add_lm_game_flags(game_parser)
    # Split argv tokens by flag ownership
    argv = sys.argv[1:]
    exp_flag_set = {opt for a in exp_parser._actions for opt in a.option_strings}
    exp_tok, game_tok, i = [], [], 0
    while i < len(argv):
        tok = argv[i]
        if tok in exp_flag_set:
            exp_tok.append(tok)
            action = exp_parser._option_string_actions[tok]
            needs_val = (
                action.nargs is None                         # default: exactly one value
                or (isinstance(action.nargs, int) and action.nargs > 0)
                or action.nargs in ("+", "*", "?")           # variable-length cases
            )
            if needs_val:
                exp_tok.append(argv[i + 1])
                i += 2
            else:                                            # store_true / store_false
                i += 1
        else:
            game_tok.append(tok)
            i += 1
    exp_args  = exp_parser.parse_args(exp_tok)
    game_args = game_parser.parse_args(game_tok)
    return exp_args, game_tok, game_args
 # --------------------------------------------------------------------------- #
 #  Helpers                                                                    #
 # --------------------------------------------------------------------------- #
 _RunInfo = collections.namedtuple(
    "_RunInfo", "index run_dir seed cmd_line returncode elapsed_s"
 )
 def _mk_run_dir(exp_dir: Path, idx: int) -> Path:
    run_dir = exp_dir / "runs" / f"run_{idx:05d}"
    # Just ensure it exists; don't raise if it already does.
    run_dir.mkdir(parents=True, exist_ok=True)
    return run_dir
 def _dump_seed(seed: int, run_dir: Path) -> None:
    seed_file = run_dir / "seed.txt"
    if not seed_file.exists():
        seed_file.write_text(str(seed))
 def _build_cmd(
    lm_game_script: Path,
    base_cli: List[str],
    run_dir: Path,
    seed: int,
    critical_base: Path | None,
    resume_from_phase: str,
 ) -> List[str]:
    """
    Returns a list suitable for subprocess.run([...]).
    """
    cmd = [sys.executable, str(lm_game_script)]
    # Forward user CLI
    cmd.extend(base_cli)
    # Per-run mandatory overrides
    cmd.extend(["--run_dir", str(run_dir)])
    cmd.extend(["--seed", str(seed)])  # you may need to add a --seed flag to lm_game
    # Critical-state mode
    if critical_base:
        cmd.extend([
            "--critical_state_analysis_dir", str(run_dir),
            "--run_dir", str(critical_base)  # base run dir (already completed)
        ])
        if resume_from_phase:
            cmd.extend(["--resume_from_phase", resume_from_phase])
    return cmd
 def _launch_one(args) -> _RunInfo:
    """
    Worker executed inside a ProcessPool; runs one game via subprocess.
    """
    (
        idx,
        lm_game_script,
        base_cli,
        run_dir,
        seed,
        critical_base,
        resume_phase,
    ) = args
    cmd = _build_cmd(
        lm_game_script, base_cli, run_dir, seed, critical_base, resume_phase
    )
    start = time.perf_counter()
    log.debug("Run %05d: CMD = %s", idx, " ".join(cmd))
    # Write out full command for traceability
    (run_dir / "command.txt").write_text(" ".join(cmd))
    try:
        result = subprocess.run(
            cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            text=True,
            check=False,
        )
        (run_dir / "console.log").write_text(result.stdout)
        rc = result.returncode
    except Exception as exc:  # noqa: broad-except
        (run_dir / "console.log").write_text(f"Exception launching run:\n{exc}\n")
        rc = 1
    elapsed = time.perf_counter() - start
    return _RunInfo(idx, run_dir, seed, " ".join(cmd), rc, elapsed)
 def _load_analysis_fns(module_names: Iterable[str]):
    """
    Dynamically import analysis modules.
    Each module must expose `run(experiment_dir: Path, cfg: dict)`.
    """
    for name in module_names:
        mod_name = f"experiment_runner.analysis.{name.strip()}"
        try:
            mod = importlib.import_module(mod_name)
        except ModuleNotFoundError as e:
            log.warning("Analysis module %s not found (%s) – skipping", mod_name, e)
            continue
        if not hasattr(mod, "run"):
            log.warning("%s has no `run()` function – skipping", mod_name)
            continue
        yield mod.run
 # --------------------------------------------------------------------------- #
 #  Main driver                                                                #
 # --------------------------------------------------------------------------- #
 def main() -> None:
    exp_args, leftover_cli, game_args = _parse_cli()
    exp_dir: Path = exp_args.experiment_dir.expanduser().resolve()
    if exp_dir.exists():
        log.info("Appending to existing experiment: %s", exp_dir)
    exp_dir.mkdir(parents=True, exist_ok=True)
    # Persist experiment-level config
    cfg_path = exp_dir / "config.json"
    if not cfg_path.exists():                     # ← new guard
        with cfg_path.open("w", encoding="utf-8") as fh:
            json.dump(
                {"experiment": vars(exp_args),
                "lm_game": vars(game_args),
                "forwarded_cli": leftover_cli},
                fh, indent=2, default=str,
            )
        log.info("Config saved to %s", cfg_path)
    else:
        log.info("Config already exists – leaving unchanged")
    log.info("Config saved to %s", cfg_path)
    # ------------------------------------------------------------------ #
    #  Launch games                                                      #
    # ------------------------------------------------------------------ #
    lm_game_script = Path(__file__).parent / "lm_game.py"
    if not lm_game_script.exists():
        log.error("lm_game.py not found at %s – abort", lm_game_script)
        sys.exit(1)
    run_args = []
    for i in range(exp_args.iterations):
        run_dir = _mk_run_dir(exp_dir, i)
        seed = exp_args.seed_base + i
        _dump_seed(seed, run_dir)
        run_args.append(
            (
                i, lm_game_script, leftover_cli, run_dir, seed,
                exp_args.critical_state_base_run,
                game_args.resume_from_phase,
            )
        )
    log.info(
        "Launching %d runs (max %d parallel, critical_state=%s)",
        exp_args.iterations,
        exp_args.parallel,
        bool(exp_args.critical_state_base_run),
    )
    runs_meta: list[_RunInfo] = []
    with concurrent.futures.ProcessPoolExecutor(
        max_workers=exp_args.parallel,
        mp_context=mp.get_context("spawn"),
    ) as pool:
        for res in pool.map(_launch_one, run_args):
            runs_meta.append(res)
            status = "OK" if res.returncode == 0 else f"RC={res.returncode}"
            log.info(
                "run_%05d finished in %.1fs %s", res.index, res.elapsed_s, status
            )
    # Persist per-run status summary
    summary_path = exp_dir / "runs_summary.json"
    with open(summary_path, "w", encoding="utf-8") as fh:
        json.dump([res._asdict() for res in runs_meta], fh, indent=2, default=str)
    log.info("Run summary written → %s", summary_path)
    # ------------------------------------------------------------------ #
    #  Post-analysis pipeline                                            #
    # ------------------------------------------------------------------ #
    mods = list(_load_analysis_fns(exp_args.analysis_modules.split(",")))
    if not mods:
        log.warning("No analysis modules loaded – done.")
        return
    analysis_root = exp_dir / "analysis"
    if analysis_root.exists():
        shutil.rmtree(analysis_root)      # ← wipes old outputs
    analysis_root.mkdir(exist_ok=True)
    # Collect common context
    ctx: dict = {
        "exp_dir": str(exp_dir),
        "runs": [str(r.run_dir) for r in runs_meta],
        "critical_state_base": str(exp_args.critical_state_base_run or ""),
        "resume_from_phase": game_args.resume_from_phase,
    }
    for fn in mods:
        name = fn.__module__.rsplit(".", 1)[-1]
        log.info("Running analysis module: %s", name)
        try:
            fn(exp_dir, ctx)
            log.info("✓ %s complete", name)
        except Exception as exc:  # noqa: broad-except
            log.exception("Analysis module %s failed: %s", name, exc)
    log.info("Experiment finished – artefacts in %s", exp_dir)
 if __name__ == "__main__":
    main()
--- a/experiment_runner/init.py
+++ b/experiment_runner/init.py
--- a/experiment_runner/analysis/init.py
+++ b/experiment_runner/analysis/init.py
--- a/experiment_runner/analysis/critical_state.py
+++ b/experiment_runner/analysis/critical_state.py
@ -0,0 +1,60 @@
 """
 Extracts the board state *before* and *after* a critical phase
 for every critical-analysis run produced by experiment_runner.
 Each run must have:
 • lmvsgame.json   – containing a phase named ctx["resume_from_phase"]
 Outputs live in  analysis/critical_state/<run_name>_{before|after}.json
 """
 from __future__ import annotations
 import json
 import logging
 from pathlib import Path
 from typing import Dict, List
 log = logging.getLogger(__name__)
 def _phase_by_name(phases: List[dict], name: str) -> dict | None:
    for ph in phases:
        if ph["state"]["name"] == name:
            return ph
    return None
 def run(exp_dir: Path, ctx: Dict) -> None:
    resume_phase = ctx.get("resume_from_phase")
    if not resume_phase:
        log.info("critical_state: --resume_from_phase not supplied – skipping")
        return
    out_dir = exp_dir / "analysis" / "critical_state"
    out_dir.mkdir(parents=True, exist_ok=True)
    for run_dir in (exp_dir / "runs").iterdir():
        game_json = run_dir / "lmvsgame.json"
        if not game_json.exists():
            continue
        with game_json.open("r") as fh:
            game = json.load(fh)
        phases = game.get("phases", [])
        if not phases:
            continue
        before = _phase_by_name(phases, resume_phase)
        after = phases[-1] if phases else None
        if before is None or after is None:
            log.warning("Run %s missing expected phases – skipped", run_dir.name)
            continue
        (out_dir / f"{run_dir.name}_before.json").write_text(
            json.dumps(before, indent=2)
        )
        (out_dir / f"{run_dir.name}_after.json").write_text(
            json.dumps(after, indent=2)
        )
    log.info("critical_state: snapshots written to %s", out_dir)
--- a/experiment_runner/analysis/summary.py
+++ b/experiment_runner/analysis/summary.py
@ -0,0 +1,177 @@
 """
 Aggregates results across all runs and writes:
 • analysis/aggregated_results.csv
 • analysis/score_summary_by_power.csv
 • analysis/results_summary.png        (if matplotlib available)
 """
 from __future__ import annotations
 import json
 import logging
 import re
 from pathlib import Path
 from typing import Any, Dict, List
 import pandas as pd
 log = logging.getLogger(__name__)
 # --------------------------------------------------------------------------- #
 #  Helpers copied (and lightly cleaned) from run_games.py                     #
 # --------------------------------------------------------------------------- #
 def _extract_diplomacy_results(game_json: Path) -> Dict[str, Dict[str, Any]]:
    with game_json.open("r", encoding="utf-8") as fh:
        gd = json.load(fh)
    phases = gd.get("phases", [])
    if not phases:
        raise ValueError("no phases")
    first_state = phases[0]["state"]
    last_state = phases[-1]["state"]
    powers = (
        list(first_state.get("homes", {}).keys())
        or list(first_state.get("centers", {}).keys())
    )
    if not powers:
        raise ValueError("cannot determine powers")
    scs_to_win = 18
    solo_winner = next(
        (p for p, sc in last_state["centers"].items() if len(sc) >= scs_to_win), None
    )
    results: Dict[str, Dict[str, Any]] = {}
    for p in powers:
        sc_count = len(last_state["centers"].get(p, []))
        units = len(last_state["units"].get(p, []))
        if solo_winner:
            if p == solo_winner:
                outcome, cat = f"Won solo", "Solo Win"
            else:
                outcome, cat = f"Lost to {solo_winner}", "Loss"
        elif sc_count == 0 and units == 0:
            outcome, cat = "Eliminated", "Eliminated"
        else:
            outcome, cat = "Ongoing/Draw", "Ongoing/Abandoned/Draw"
        results[p] = {
            "OutcomeCategory": cat,
            "StatusDetail": outcome,
            "SupplyCenters": sc_count,
            "LastPhase": last_state["name"],
        }
    return results
 # simplistic "Diplobench" scoring from previous discussion ------------------ #
 def _year(name: str) -> int | None:
    m = re.search(r"(\d{4})", name)
    return int(m.group(1)) if m else None
 def _score_game(game_json: Path) -> Dict[str, int]:
    with open(game_json, "r") as fh:
        game = json.load(fh)
    phases = game["phases"]
    if not phases:
        return {}
    start = _year(phases[0]["state"]["name"])
    end_year = _year(phases[-1]["state"]["name"]) or start
    max_turns = (end_year - start + 1) if start is not None else len(phases)
    last_state = phases[-1]["state"]
    solo_winner = next(
        (p for p, scs in last_state["centers"].items() if len(scs) >= 18), None
    )
    elim_turn: Dict[str, int | None] = {}
    for p in last_state["centers"].keys():
        e_turn = None
        for idx, ph in enumerate(phases):
            if not ph["state"]["centers"].get(p, []):
                yr = _year(ph["state"]["name"]) or 0
                e_turn = (yr - start + 1) if start is not None else idx + 1
                break
        elim_turn[p] = e_turn
    scores: Dict[str, int] = {}
    for p, scs in last_state["centers"].items():
        if p == solo_winner:
            win_turn = (end_year - start + 1) if start is not None else len(phases)
            scores[p] = max_turns + 17 + (max_turns - win_turn)
        elif solo_winner:
            # losers in a solo game
            yr_win = _year(phases[-1]["state"]["name"]) or end_year
            turn_win = (yr_win - start + 1) if start is not None else len(phases)
            scores[p] = turn_win
        else:
            if elim_turn[p] is None:
                scores[p] = max_turns + len(scs)
            else:
                scores[p] = elim_turn[p]
    return scores
 # --------------------------------------------------------------------------- #
 #  Public entry point                                                         #
 # --------------------------------------------------------------------------- #
 def run(exp_dir: Path, ctx: dict):  # pylint: disable=unused-argument
    analysis_dir = exp_dir / "analysis"
    analysis_dir.mkdir(exist_ok=True)
    rows: List[Dict[str, Any]] = []
    for run_dir in (exp_dir / "runs").iterdir():
        game_json = run_dir / "lmvsgame.json"
        if not game_json.exists():
            continue
        gid = run_dir.name
        try:
            res = _extract_diplomacy_results(game_json)
        except Exception as e:  # noqa: broad-except
            log.warning("Could not parse %s (%s)", game_json, e)
            continue
        scores = _score_game(game_json)
        for pwr, info in res.items():
            out = {"GameID": gid, "Power": pwr, **info, "Score": scores.get(pwr, None)}
            rows.append(out)
    if not rows:
        log.warning("summary: no parsable runs found")
        return
    df = pd.DataFrame(rows)
    out_csv = analysis_dir / "aggregated_results.csv"
    df.to_csv(out_csv, index=False)
    summary = (
        df.groupby("Power")["Score"]
        .agg(["mean", "median", "count"])
        .reset_index()
        .rename(columns={"count": "n"})
    )
    summary.to_csv(analysis_dir / "score_summary_by_power.csv", index=False)
    log.info("summary: wrote %s rows to %s", len(df), out_csv)
    # Optional charts
    try:
        import matplotlib.pyplot as plt
        import seaborn as sns
        sns.set_style("whitegrid")
        plt.figure(figsize=(10, 7))
        sns.boxplot(x="Power", y="SupplyCenters", data=df, palette="pastel")
        plt.title("Supply-center distribution")
        plt.savefig(analysis_dir / "results_summary.png", dpi=150)
        plt.close()
        log.info("summary: chart saved")
    except Exception as e:  # noqa: broad-except
        log.debug("Chart generation skipped (%s)", e)
--- a/lm_game.py
+++ b/lm_game.py
@ -55,6 +55,12 @@ def parse_arguments():
        default="",
        help="Directory for results. If it exists, the game resumes. If not, it's created. Defaults to a new timestamped directory.",
    )
    parser.add_argument(
        "--output",            # alias for back compatibility
        dest="run_dir",        # write to the same variable as --run_dir
        type=str,
        help=argparse.SUPPRESS # hides it from `--help`
    )
    parser.add_argument(
        "--critical_state_analysis_dir",
        type=str,
@ -105,12 +111,24 @@ def parse_arguments():
        default=16000,
        help="Maximum number of new tokens to generate per LLM call (default: 16000)."
    )
    parser.add_argument(
        "--seed_base",
        type=int,
        default=42,
        help="RNG seed placeholder for compatibility with experiment_runner. Currently unused."
    )
    parser.add_argument(
        "--max_tokens_per_model",
        type=str,
        default="",
        help="Comma-separated list of 7 token limits (in order: AUSTRIA, ENGLAND, FRANCE, GERMANY, ITALY, RUSSIA, TURKEY). Overrides --max_tokens."
    )
    parser.add_argument(
        "--prompts_dir",
        type=str,
        default=None,
        help="Path to the directory containing prompt files. Defaults to the packaged prompts directory."
    )
    return parser.parse_args()
@ -179,8 +197,17 @@ async def main():
    if is_resuming:
        try:
            # When resuming, we load the state and also the config from the last saved phase.
-            # We will continue to use THAT config, ignoring the current script's args.
+            game, agents, game_history, loaded_run_config = load_game_state(run_dir, game_file_name, args.resume_from_phase)
-            game, agents, game_history = load_game_state(run_dir, game_file_name, args.resume_from_phase)
+            if loaded_run_config:
                # Use the saved config, but allow current CLI args to override control-flow parameters
                run_config = loaded_run_config
                run_config.run_dir = args.run_dir
                run_config.critical_state_analysis_dir = args.critical_state_analysis_dir
                run_config.resume_from_phase = args.resume_from_phase
                run_config.end_at_phase = args.end_at_phase
                # If prompts_dir is specified now, it overrides the saved one.
                if args.prompts_dir is not None:
                    run_config.prompts_dir = args.prompts_dir
            logger.info(f"Successfully resumed game from phase: {game.get_current_phase()}.")
        except (FileNotFoundError, ValueError) as e:
            logger.error(f"Could not resume game: {e}. Starting a new game instead.")
@ -204,7 +231,7 @@ async def main():
        if year_int > run_config.max_year:
            logger.info(f"Reached max year {run_config.max_year}, stopping simulation.")
            break
-        if run_config.end_at_phase and game.phases and game.phases[-1].name == run_config.end_at_phase:
+        if run_config.end_at_phase and current_phase == run_config.end_at_phase:
            logger.info(f"Reached end phase {run_config.end_at_phase}, stopping simulation.")
            break
@ -289,7 +316,7 @@ async def main():
        # Diary Consolidation
        if current_short_phase.startswith("S") and current_short_phase.endswith("M"):
            consolidation_tasks = [
-                run_diary_consolidation(agent, game, llm_log_file_path)
+                run_diary_consolidation(agent, game, llm_log_file_path, prompts_dir=run_config.prompts_dir)
                for agent in agents.values() if not game.powers[agent.power_name].is_eliminated()
            ]
            if consolidation_tasks:
--- a/requirements.txt
+++ b/requirements.txt
@ -17,3 +17,5 @@ anthropic
 google-genai
 json-repair
 together
 matplotlib
 seaborn
--- a/run.sh
+++ b/run.sh
@ -2,6 +2,7 @@
 python3 lm_game.py \
    --max_year 1901 \
-    --num_negotiation_rounds 0 \
+    --num_negotiation_rounds 1 \
    --models "openrouter-google/gemini-2.5-flash-lite-preview-06-17, openrouter-google/gemini-2.5-flash-lite-preview-06-17, openrouter-google/gemini-2.5-flash-lite-preview-06-17, openrouter-google/gemini-2.5-flash-lite-preview-06-17, openrouter-google/gemini-2.5-flash-lite-preview-06-17, openrouter-google/gemini-2.5-flash-lite-preview-06-17, openrouter-google/gemini-2.5-flash-lite-preview-06-17" \
-    --max_tokens_per_model 16000,16000,16000,16000,16000,16000,16000
+    --max_tokens_per_model 16000,16000,16000,16000,16000,16000,16000 \
    --prompts_dir "ai_diplomacy/prompts"