atropos/SLURM.md
2026-01-28 16:48:39 +01:00

175 lines
10 KiB
Markdown

## Using `ServerManager` with Slurm
The `ServerManager` class in `atroposlib` provides built-in support for discovering and managing inference servers distributed across nodes allocated by Slurm. Here's how to use it:
**Core Concept:**
The setup assumes you have a Slurm job allocation where:
1. One or more nodes are designated for your main "training" or orchestrator process (the script that initializes `ServerManager`).
2. The remaining nodes in the allocation are dedicated to running the LLM inference servers (e.g., SGLang, TGI, vLLM, etc., accessible via an OpenAI-compatible API).
**How `ServerManager` Detects Servers:**
When you initialize `ServerManager` with `slurm=True`:
1. It reads the `SLURM_JOB_NODELIST` environment variable to get the hostnames of all allocated nodes. It uses the `scontrol show hostnames` command internally.
2. It reads the `NUM_TRAINING_NODES` environment variable. This crucial variable tells the manager how many nodes *at the beginning* of the nodelist are *reserved for the training/orchestrator process* and should **not** be treated as inference server nodes.
3. It iterates through the hostnames *after* the first `NUM_TRAINING_NODES`. These are assumed to be the inference nodes.
4. For each inference node, it constructs potential server URLs. By default, it assumes:
* Servers run on ports starting from `9000` (`9000`, `9001`, `9002`, ...).
* The number of server instances per node is determined by `8 // INFER_TP` (where `INFER_TP` is another environment variable, defaulting to 1 if not set, implying 8 servers per node). You should set `INFER_TP` according to your inference server's tensor parallelism configuration if applicable.
* The URL format is `http://{node_hostname}:{port}/v1`.
5. It uses the *first* configuration object you pass in the `configs` list as a template (for settings like `timeout`, `num_max_requests_at_once`, etc.) and creates specific `APIServerConfig` objects for each discovered URL.
6. The `ServerManager` then load-balances requests across these automatically configured `OpenAIServer` instances.
**Setup Steps:**
1. **Launch Inference Servers:** In your Slurm submission script (`sbatch`), launch your inference server instances on the designated inference nodes.
* Ensure they listen on the correct hostname and the expected ports (9000, 9001, ...).
* The number of instances per node should match the `8 // INFER_TP` logic. Adjust the port range or `INFER_TP` environment variable accordingly if your setup differs.
* You might use `srun` to launch these processes on specific nodes.
2. **Set Environment Variables:** In the part of your Slurm script that launches your *main application* (the one using `ServerManager`):
* `export NUM_TRAINING_NODES=<number_of_non_inference_nodes>` (e.g., `export NUM_TRAINING_NODES=1` if only the first node runs the main script).
* `export INFER_TP=<your_tensor_parallel_size>` (Optional, defaults to 1. Set this if your inference servers use tensor parallelism and you run fewer than 8 instances per node).
3. **Initialize `ServerManager`:** In your Python script:
```python
from atroposlib.envs.server_handling.server_manager import ServerManager, ServerBaseline, APIServerConfig
# Provide at least one config object. It will be used as a template
# for Slurm-discovered servers if slurm=True.
# If you pass ServerBaseline, ensure NUM_TRAINING_NODES and potentially INFER_TP are set.
# If you pass a list of APIServerConfig, the first one is used as the template.
base_config = ServerBaseline(
timeout=1200,
# other baseline settings...
)
# OR
# base_config = APIServerConfig(
# base_url="http://dummy", # This URL is ignored when slurm=True finds nodes
# api_key="dummy",
# timeout=1200,
# # other config settings...
# )
server_manager = ServerManager(
configs=base_config, # Or [base_config] if using APIServerConfig
slurm=True
)
# Now use server_manager.chat_completion(...) or server_manager.completion(...)
```
4. **Submit Slurm Job:** Submit your job ensuring the necessary nodes and resources (like GPUs for inference) are requested.
**Example Conceptual Slurm Script:**
```bash
#!/bin/bash
#SBATCH --nodes=5 # 1 trainer node + 4 inference nodes
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8 # Assuming 8 GPUs/node for inference
#SBATCH --job-name=atropos-rl
# Get allocated node hostnames
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST)
nodes_array=($nodes)
training_node=${nodes_array[0]}
inference_nodes=${nodes_array[@]:1} # Nodes from index 1 onwards
echo "Training Node: $training_node"
echo "Inference Nodes: ${inference_nodes[@]}"
# --- Launch Inference Servers (Example using srun, adapt for your server type) ---
TP_SIZE=1 # Example: Tensor Parallelism = 1
INSTANCES_PER_NODE=$((8 / TP_SIZE))
echo "Launching $INSTANCES_PER_NODE inference servers per node..."
for node in ${inference_nodes[@]}; do
for i in $(seq 0 $((INSTANCES_PER_NODE - 1))); do
port=$((9000 + i))
gpu_id=$i # Basic GPU assignment, might need refinement
echo "Starting server on $node:$port (GPU $gpu_id)"
srun --nodes=1 --ntasks=1 --gpus-per-task=1 --gpu-bind=map_gpu:$gpu_id --nodelist=$node \
your_inference_server_launch_cmd --host 0.0.0.0 --port $port --tp $TP_SIZE [other_args] &
done
done
echo "Waiting for servers to start..."
sleep 60 # Simple wait, consider a more robust check
# --- Launch W&B Watcher on each Inference Node ---
echo "Launching W&B watchers..."
# Assume the main API server runs on the training_node at default port 8000
TRAINER_API_ADDR="http://${training_node}:8000"
inference_node_index=0 # Start index for node_num
for node in ${inference_nodes[@]}; do
echo "Starting watcher on $node (Node Index $inference_node_index)"
srun --nodes=1 --ntasks=1 --nodelist=$node \
python atroposlib/cli/inference_node_wandb_watcher.py \
--api_addr $TRAINER_API_ADDR \
--tp $TP_SIZE \
--node_num $inference_node_index &
inference_node_index=$((inference_node_index + 1))
done
# --- Launch Main Application on the Training Node ---
export NUM_TRAINING_NODES=1
export INFER_TP=$TP_SIZE
echo "Starting main application on $training_node..."
srun --nodes=1 --ntasks=1 --nodelist=$training_node \
python your_main_atropos_script.py --some_arg=value
echo "Job finished."
wait # Wait for background server processes launched with '&'
```
**Important Notes:**
* This setup relies on the `scontrol` command being available in the environment where `ServerManager` is initialized.
* Ensure network connectivity and firewall rules allow the training node(s) to reach the inference nodes on ports 9000+.
* The logic assumes a specific port assignment (9000+) and server count based on `INFER_TP`. If your inference server setup differs (e.g., different ports, different discovery mechanism), you would need to modify `server_manager.py` or manually provide the correct list of `APIServerConfig` objects instead of relying on `slurm=True`.
## Monitoring Inference Nodes with Weights & Biases
Atropos includes a utility script, `inference-node-wandb-watcher`, located in `atroposlib/cli/`, designed to run on each inference node alongside the inference servers.
**Purpose:**
* **Health Monitoring:** Periodically checks the `/health_generate` endpoint of each local inference server instance (assuming ports 9000+).
* **W&B Logging:** Logs the health status (1 for healthy, 0 for unhealthy) of each server instance to a shared Weights & Biases run group. This allows you to visualize server uptime and availability directly in your W&B dashboard alongside your training metrics.
* **Step Synchronization:** It fetches the current training step from the main Atropos API server (`run-api`) to ensure W&B logs are correctly associated with training progress.
**Integration into Slurm Script:**
You can launch this watcher on each inference node using `srun` similarly to how the inference servers are launched. Add the following section to the example Slurm script, **after** launching the inference servers and **before** launching the main application:
```bash
# --- Launch W&B Watcher on each Inference Node ---
echo "Launching W&B watchers..."
# Assume the main API server runs on the training_node at default port 8000
TRAINER_API_ADDR="http://${training_node}:8000"
inference_node_index=0 # Start index for node_num
for node in ${inference_nodes[@]}; do
echo "Starting watcher on $node (Node Index $inference_node_index)"
srun --nodes=1 --ntasks=1 --nodelist=$node \
python atroposlib/cli/inference_node_wandb_watcher.py \
--api_addr $TRAINER_API_ADDR \
--tp $TP_SIZE \
--node_num $inference_node_index &
inference_node_index=$((inference_node_index + 1))
done
```
**Explanation of Arguments:**
* `--api_addr`: This is the address of the main Atropos API server (usually started with `run-api`). The script needs this to fetch W&B project/group info and the current training step. In the example, we construct it assuming the API runs on the `training_node` (first node in the allocation) at port `8000` (the default for `run-api`). **Ensure this port is correct and accessible from the inference nodes.**
* `--tp`: This should be the same tensor parallelism size (`TP_SIZE`) used when launching the inference servers. It tells the watcher how many server instances (ports 9000 to 9000 + `8 // TP_SIZE` - 1) to monitor on the local node.
* `--node_num`: A unique integer identifying this specific inference node within the Slurm job. This helps distinguish the metrics from different nodes in W&B (e.g., `server/server_health_0_0`, `server/server_health_1_0`). The example script assigns sequential indices starting from 0.
**Important Notes:**
* Ensure the `run-api` server is running and accessible from the inference nodes.
* The `inference-node-wandb-watcher` script should be executable and accessible from the inference nodes.
* The script assumes the default port for the `run-api` server (8000). If your setup uses a different port, you may need to modify the script or the port in the `TRAINER_API_ADDR` construction.