Anyway provides an OpenAI-compatible API server endpoint via the anyd command. To deploy a model across multiple machines: run anyd on each machine, ensure all machines can reach each other on the network, and Anyway handles orchestration automatically. The command follows the structure below.
anyd --model=<path/to/model.gguf> --model-ctx=<N> --oapi=<OAPI_port> [<node_addr2> <node_addr3> ...]
anyd --model=gpt-oss-120b-Q4_K_M-00001-of-00002.gguf --model-ctx=4096 --oapi=8080 192.168.0.1 192.168.0.2
anyd --model=gpt-oss-120b-Q4_K_M-00001-of-00002.gguf --model-ctx=4096 --oapi=8080 192.168.0.0 192.168.0.2
anyd --model=gpt-oss-120b-Q4_K_M-00001-of-00002.gguf --model-ctx=4096 --oapi=8080 192.168.0.0 192.168.0.1
| Parameter | Description | Example |
|---|---|---|
--model=FILE |
Path to your GGUF model file. Anyway supports Hugging Face models in GGUF format at every quantization level (you can find a lot of them here). If the model is stored in mutliple gguf files, only provide the first one. | gpt-oss-120b-Q4_K_M-00001-of-00002.ggufor gpt-oss-20b-Q4_K_M.gguf |
--model-ctx=LENGTH |
Context window size (number of tokens). Larger values allow longer conversations but require more memory. | 4096 |
--oapi=[IP4:]PORT |
OpenAI-compatible API endpoint port. Your applications connect to this port for AI inference. (default: 0.0.0.0:PORT) |
8080or 127.0.0.1:8080 |
--log[=FILE] (optionnal) |
Print the logs on stderr or in FILE if specified |
--logor --log=logs.txt |
--peer=[IP4:]PORT (optionnal) |
Listen for other peers connections on IP4:PORT (default: 0.0.0.0:13060) |
--peer=13060or --peer=localhost:13060 |
--mem=SIZE (optionnal) |
Use at most SIZE bytes of VRAM (default: max capacity of VRAM) |
--mem=3Gor --mem=400M |
[node addresses] |
IP addresses of other nodes in your cluster. List all other nodes but exclude the current node's own address. | 192.168.0.1 192.168.0.2 |
Anyway provides an OpenAI-compatible API that allows you to integrate with existing applications and tools designed for OpenAI's API. The following endpoints are supported:
| Endpoint | Description | Supported Parameters |
|---|---|---|
v1/models |
Lists available models deployed on your Anyway cluster. Returns information about the model currently running on your nodes. | All mandatory parameters |
v1/chat/completions |
Generates chat completions for conversational AI applications. Accepts a series of messages and returns the model's response in a chat format. | All mandatory parameters + max_completion_tokens+ stream |
v1/embeddings |
Generates vector embeddings for text input. Useful for semantic search, similarity comparison, and other NLP tasks that require dense vector representations. | All mandatory parameters |
For endpoint specifications, refer to the official OpenAI API documentation.
from openai import OpenAI
client = OpenAI(
base_url="http://192.168.0.0:8080/v1",
api_key="not_needed"
)
response = client.chat.completions.create(
model="gpt-oss-120b-Q4_K_M",
messages=[
{"role": "user", "content": "Hello!"}
],
max_tokens=100
)
print(response.choices[0].message.content)
curl -L http://192.168.0.0:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-120b-Q4_K_M",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100,
"stream": true
}'
curl, use the -L flag. For the OpenAI Python API, redirects are already handled automatically.
Anyway follows the standard OpenAI API error response protocol with the following additionnal informations.
finish_reason types for v1/chat/completions:Indicates that the request should be retried. This typically occurs when a node crashes and Anyway transparently performs failover and load balancing to ensure that the next request succeeds.
Indicates insufficient memory. This occurs when Anyway attempts to load the requested model, but the combined available memory across all nodes is insufficient to load the model and its required context simultaneously.
A general system error occurred.