Skip to content

Python SDK

The Tinfer Python SDK lets you manage servers and make API calls programmatically.

pip install tinfer-ai

Quick Example

from tinfer import Server, chat

# Start server and chat in one block
with Server("model.gguf", port=8080, n_gpu_layers=-1) as s:
    response = chat("What is artificial intelligence?")
    print(response)
# Server automatically stops when the block exits

Server Class

The Server class manages the tinfer-server process lifecycle.

Constructor

from tinfer import Server

server = Server(
    model_path="path/to/model.gguf",  # Required: path to GGUF file
    port=8080,                         # Server port (default: 8080)
    host="127.0.0.1",                  # Bind address (default: 127.0.0.1)
    n_gpu_layers=None,                 # GPU layers (-1 = all, None = auto)
    ctx_size=None,                     # Context window size
    n_parallel=None,                   # Parallel request slots
    extra_args=None                    # Additional CLI flags as list
)
Parameter Type Description Default
model_path str Path to GGUF model file required
port int Port for the server 8080
host str Host/IP to bind to 127.0.0.1
n_gpu_layers int Layers to offload to GPU (-1=all) None (auto)
ctx_size int Context window size None (model default)
n_parallel int Number of parallel slots None (auto)
extra_args list Additional command-line flags []

Methods

start(timeout=30)

Start the server and wait until it's healthy.

server = Server("model.gguf", port=8080)
server.start(timeout=30)  # Waits up to 30 seconds
print(f"Server ready at {server.base_url}")

Warning

Raises RuntimeError if the server fails to start or doesn't become healthy within the timeout.

stop()

Stop the server process.

server.stop()

is_running()

Check if the server is running and healthy.

if server.is_running():
    print("Server is ready")

Context Manager

The recommended way to use the server — it auto-starts and auto-stops:

from tinfer import Server, chat

with Server("model.gguf", port=8080) as s:
    print(s.base_url)  # http://127.0.0.1:8080
    response = chat("Hello!")
    print(response)
# Server automatically stopped here

Extra Args Example

Pass any additional tinfer-server flags:

server = Server(
    "model.gguf",
    port=8080,
    extra_args=["--flash-attn", "on", "-c", "8192", "--api-key", "my-key"]
)

Client Functions

set_server(url)

Set the default server URL for all client functions.

from tinfer import set_server

set_server("http://192.168.1.100:8080")  # Remote server
set_server("http://localhost:8080")       # Local (default)

chat()

Send a chat completion request. Returns the assistant's response as a string.

from tinfer import chat

# Simple string prompt
response = chat("What is the capital of France?")
print(response)

# Full message format
response = chat([
    {"role": "system", "content": "You are a pirate."},
    {"role": "user", "content": "What's the weather?"}
])

# With parameters
response = chat(
    messages="Explain quantum computing",
    max_tokens=200,
    temperature=0.5,
    top_p=0.9
)
Parameter Type Description Default
messages str or list Message string or list of {role, content} dicts
prompt str Shorthand for a single user message
model str Model name (optional)
max_tokens int Max tokens to generate 512
temperature float Sampling temperature 0.7
top_p float Nucleus sampling 0.95
stream bool Stream response False
base_url str Override server URL

Tip

Either messages or prompt must be provided. If messages is a string, it's automatically wrapped as [{"role": "user", "content": "..."}].


complete()

Send a text completion request (non-chat format).

from tinfer import complete

text = complete(
    prompt="The capital of France is",
    max_tokens=50,
    temperature=0.3
)
print(text)
Parameter Type Description Default
prompt str Text to complete required
model str Model name
max_tokens int Max tokens 512
temperature float Temperature 0.7
top_p float Top-P 0.95
base_url str Override server URL

models()

List models loaded on the server.

from tinfer import models

for m in models():
    print(m["id"])

Returns a list of model info dictionaries from the /v1/models endpoint.


Error Handling

All client functions raise clear errors:

from tinfer import chat

try:
    response = chat("Hello")
except ConnectionError:
    print("Server is not running! Start it with:")
    print("  tinfer-server -m model.gguf --port 8080")
except RuntimeError as e:
    print(f"Server error: {e}")
Exception When
ConnectionError Server is not running or unreachable
RuntimeError Server returned an HTTP error
FileNotFoundError Model file not found (Server.start)
ValueError Neither messages nor prompt provided (chat)

Complete Example

from tinfer import Server, chat, complete, models

# Start server
with Server(
    r"C:\models\Llama-3.2-3B-Instruct-Q4_K_M.gguf",
    port=8080,
    n_gpu_layers=-1
) as s:
    # List models
    print("Loaded models:", [m["id"] for m in models()])

    # Chat
    print(chat("What is Python?"))

    # Chat with system prompt
    print(chat([
        {"role": "system", "content": "Answer in exactly 10 words."},
        {"role": "user", "content": "What is machine learning?"}
    ]))

    # Complete
    print(complete("def fibonacci(n):", max_tokens=100))