5.7 KiB
Server Handling
This module provides server abstraction layers for different LLM inference backends.
ManagedServer
For automatic token and logprob tracking, see the ManagedServer Guide.
Note: OpenAI endpoints do not support token IDs/logprobs required for ManagedServer. Set
ATROPOS_ALLOW_DUMMY_MANAGED_SERVER=1to use a placeholder implementation for testing/evaluation. See OpenAI Endpoint Limitations for details.
Normalized get_logprobs API
ManagedServer and supported server backends expose a normalized get_logprobs(...) interface so callers can consume a single schema:
prompt_tokensprompt_topk_token_idsprompt_topk_logprobs
Backends are expected to return real prompt top-k arrays ([pos][k]) matching this schema.
Tool Call Support
ManagedServer supports OpenAI-style tool calling via vLLM's tool parsers. Pass tool_parser at init:
server_manager = ServerManager(
configs=[APIServerConfig(...)],
tool_parser="hermes", # or llama3_json, mistral, deepseek_v3, qwen3_coder, etc.
)
async with server_manager.managed_server(tokenizer=tokenizer) as managed:
result = await managed.chat_completion(
messages=[{"role": "user", "content": "What's the weather?"}],
tools=[{
"type": "function",
"function": {"name": "get_weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}}
}],
tool_choice="auto", # "auto", "none", "required"
)
# Structured tool_calls in response
if result.choices[0].message.tool_calls:
print(result.choices[0].message.tool_calls)
# Nodes still have raw text with <tool_call> tags for training
nodes = managed.get_state()["nodes"]
Requires vllm installed. Without it, tool parsing is disabled with a warning — everything else still works.
OpenAI Proxy
Exposes ManagedServer as an OpenAI-compatible HTTP API for external tools (CLIs, GUIs, microservices).
Standalone
python -m atroposlib.envs.server_handling.managed_server_proxy \
--config servers.json --port 9100
servers.json:
{
"model_name": "Qwen/Qwen3-4B",
"servers": [
{"base_url": "http://gpu1:8000/v1", "server_type": "vllm"},
{"base_url": "http://gpu2:8000/v1", "server_type": "vllm"}
]
}
Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /sessions/create |
Create session. Optional base_url to pin to a server, tool_parser name. |
| POST | /{uuid}/v1/chat/completions |
OpenAI chat completions (with tools support). |
| POST | /{uuid}/v1/chat/completions/render |
Preview rendered prompt without generating. |
| GET | /{uuid}/nodes |
Get tracked tokens/logprobs/masks for training. |
| DELETE | /{uuid} |
Cleanup session. |
| GET | /sessions |
List active sessions. |
| GET | /servers |
List backend servers. |
| POST | /setup |
Push server config (used by ServerManager). |
| GET | /v1/models |
List models. |
| GET | /health |
Health check. |
Via ServerManager
server_manager = ServerManager(
configs=[APIServerConfig(...)],
proxy_url="http://localhost:9100", # auto-enables proxy mode
tool_parser="hermes",
)
# managed_server() now routes through the proxy
async with server_manager.managed_server(tokenizer=tokenizer) as managed:
result = await managed.chat_completion(messages=[...], tools=[...])
url = managed.get_url() # "http://localhost:9100/{uuid}/v1" — hand to external apps
nodes = await managed.fetch_state() # get tokens/logprobs
Or set ATROPOS_PROXY_URL=http://localhost:9100 env var instead of passing proxy_url.
Reasoning Model Support
The ReasoningConfig class enables support for reasoning/thinking models across different providers.
Provider Differences
| Feature | OpenAI | OpenRouter / Others |
|---|---|---|
| Format | {"reasoning_effort": "high"} |
{"reasoning": {"enabled": true, "effort": "high"}} |
| Effort Levels | none, minimal, low, medium, high, xhigh |
none, minimal, low, medium, high, xhigh |
| Max Tokens | Not supported | {"reasoning": {"max_tokens": 16000}} |
| Temperature | Must be 1.0 |
No restriction |
| Token Param | max_completion_tokens |
max_tokens |
Effort Level to Token Mapping
When providers don't support effort strings, effort levels map to approximate token budgets (based on 32k base):
| Effort | Tokens | Percentage |
|---|---|---|
| none | 1,024 | Minimum |
| minimal | 3,200 | ~10% |
| low | 6,400 | ~20% |
| medium | 16,000 | ~50% |
| high | 25,600 | ~80% |
| xhigh | 30,400 | ~95% |
Provider Token Limits
- OpenRouter: Caps Anthropic reasoning at 1,024-32,000 tokens (docs)
- Native Anthropic: Supports up to 128k extended thinking tokens
Usage
Reasoning is only injected for chat completions (not completions or logprobs API).
# Via environment config
config = BaseEnvConfig(
thinking_mode=True,
reasoning_effort="high",
max_reasoning_tokens=16000,
)
# Direct ReasoningConfig
reasoning_config = ReasoningConfig(
enabled=True,
effort="high",
max_tokens=16000,
)
Bypassing Reasoning Injection
Pass skip_reasoning=True to any chat completion call:
await server.chat_completion(messages=messages, skip_reasoning=True)
Important Constraints
- OpenRouter: Only accepts ONE of
effortormax_tokens, not both. When both specified, effort takes priority. - OpenAI: All effort levels are passed through directly.
- Auto-enable: Setting
effortormax_tokensautomatically enables reasoning mode.