mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

History

Jai Suphavadeeprasit 1b8ff075c4 adding tests		2026-03-13 17:23:59 -04:00
..
MANAGED_SERVER.md	fixing comments	2026-03-03 23:16:05 -05:00
managed_server.py	clean log	2026-03-13 12:12:33 -04:00
managed_server_proxy.py	add tool call parsing based on vllm impl and an openai server endpoint	2026-03-02 23:17:13 -06:00
openai_server.py	changes	2026-03-13 16:58:37 -04:00
proxy_client.py	add tool call parsing based on vllm impl and an openai server endpoint	2026-03-02 23:17:13 -06:00
README.md	Merge conflict commit	2026-03-09 23:13:43 -05:00
server_baseline.py	managed_Server pass through and centralize sem logic	2026-03-05 15:46:33 -05:00
server_harness.py	fix tests	2025-10-29 10:55:10 -05:00
server_manager.py	clean logging	2026-03-13 12:38:52 -04:00
sglang_server.py	add tokenizer name config to set the vllm/sglang tokenizer to something different if needed	2026-02-09 15:26:29 -06:00
tool_call_translator.py	add env using the tool api stuff	2026-03-03 19:51:30 -06:00
trl_vllm_server.py	Add reasoning configuration support across server implementations	2026-01-05 23:20:01 +00:00
vllm_server.py	adding tests	2026-03-13 17:23:59 -04:00

README.md

Server Handling

This module provides server abstraction layers for different LLM inference backends.

ManagedServer

For automatic token and logprob tracking, see the ManagedServer Guide.

Note: OpenAI endpoints do not support token IDs/logprobs required for ManagedServer. Set ATROPOS_ALLOW_DUMMY_MANAGED_SERVER=1 to use a placeholder implementation for testing/evaluation. See OpenAI Endpoint Limitations for details.

Normalized `get_logprobs` API

ManagedServer and supported server backends expose a normalized get_logprobs(...) interface so callers can consume a single schema:

prompt_tokens
prompt_topk_token_ids
prompt_topk_logprobs

Backends are expected to return real prompt top-k arrays ([pos][k]) matching this schema.

Tool Call Support

ManagedServer supports OpenAI-style tool calling via vLLM's tool parsers. Pass tool_parser at init:

server_manager = ServerManager(
    configs=[APIServerConfig(...)],
    tool_parser="hermes",  # or llama3_json, mistral, deepseek_v3, qwen3_coder, etc.
)

async with server_manager.managed_server(tokenizer=tokenizer) as managed:
    result = await managed.chat_completion(
        messages=[{"role": "user", "content": "What's the weather?"}],
        tools=[{
            "type": "function",
            "function": {"name": "get_weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}}
        }],
        tool_choice="auto",  # "auto", "none", "required"
    )

    # Structured tool_calls in response
    if result.choices[0].message.tool_calls:
        print(result.choices[0].message.tool_calls)

    # Nodes still have raw text with <tool_call> tags for training
    nodes = managed.get_state()["nodes"]

Requires vllm installed. Without it, tool parsing is disabled with a warning — everything else still works.

OpenAI Proxy

Exposes ManagedServer as an OpenAI-compatible HTTP API for external tools (CLIs, GUIs, microservices).

Standalone

python -m atroposlib.envs.server_handling.managed_server_proxy \
    --config servers.json --port 9100

servers.json:

{
    "model_name": "Qwen/Qwen3-4B",
    "servers": [
        {"base_url": "http://gpu1:8000/v1", "server_type": "vllm"},
        {"base_url": "http://gpu2:8000/v1", "server_type": "vllm"}
    ]
}

Endpoints

Method	Path	Description
POST	`/sessions/create`	Create session. Optional `base_url` to pin to a server, `tool_parser` name.
POST	`/{uuid}/v1/chat/completions`	OpenAI chat completions (with tools support).
POST	`/{uuid}/v1/chat/completions/render`	Preview rendered prompt without generating.
GET	`/{uuid}/nodes`	Get tracked tokens/logprobs/masks for training.
DELETE	`/{uuid}`	Cleanup session.
GET	`/sessions`	List active sessions.
GET	`/servers`	List backend servers.
POST	`/setup`	Push server config (used by ServerManager).
GET	`/v1/models`	List models.
GET	`/health`	Health check.

Via ServerManager

server_manager = ServerManager(
    configs=[APIServerConfig(...)],
    proxy_url="http://localhost:9100",  # auto-enables proxy mode
    tool_parser="hermes",
)

# managed_server() now routes through the proxy
async with server_manager.managed_server(tokenizer=tokenizer) as managed:
    result = await managed.chat_completion(messages=[...], tools=[...])
    url = managed.get_url()  # "http://localhost:9100/{uuid}/v1" — hand to external apps
    nodes = await managed.fetch_state()  # get tokens/logprobs

Or set ATROPOS_PROXY_URL=http://localhost:9100 env var instead of passing proxy_url.

Reasoning Model Support

The ReasoningConfig class enables support for reasoning/thinking models across different providers.

Provider Differences

Feature	OpenAI	OpenRouter / Others
Format	`{"reasoning_effort": "high"}`	`{"reasoning": {"enabled": true, "effort": "high"}}`
Effort Levels	`none`, `minimal`, `low`, `medium`, `high`, `xhigh`	`none`, `minimal`, `low`, `medium`, `high`, `xhigh`
Max Tokens	Not supported	`{"reasoning": {"max_tokens": 16000}}`
Temperature	Must be `1.0`	No restriction
Token Param	`max_completion_tokens`	`max_tokens`

Effort Level to Token Mapping

When providers don't support effort strings, effort levels map to approximate token budgets (based on 32k base):

Effort	Tokens	Percentage
none	1,024	Minimum
minimal	3,200	~10%
low	6,400	~20%
medium	16,000	~50%
high	25,600	~80%
xhigh	30,400	~95%

Provider Token Limits

OpenRouter: Caps Anthropic reasoning at 1,024-32,000 tokens (docs)
Native Anthropic: Supports up to 128k extended thinking tokens

Usage

Reasoning is only injected for chat completions (not completions or logprobs API).

# Via environment config
config = BaseEnvConfig(
    thinking_mode=True,
    reasoning_effort="high",
    max_reasoning_tokens=16000,
)

# Direct ReasoningConfig
reasoning_config = ReasoningConfig(
    enabled=True,
    effort="high",
    max_tokens=16000,
)

Bypassing Reasoning Injection

Pass skip_reasoning=True to any chat completion call:

await server.chat_completion(messages=messages, skip_reasoning=True)

Important Constraints

OpenRouter: Only accepts ONE of effort or max_tokens, not both. When both specified, effort takes priority.
OpenAI: All effort levels are passed through directly.
Auto-enable: Setting effort or max_tokens automatically enables reasoning mode.