Skip to content

API Reference

When tinfer-server is running, it exposes OpenAI-compatible HTTP endpoints. All requests go to http://localhost:8080 by default.


Health Check

GET /health

Check if the server is ready.

curl http://localhost:8080/health
Status Response Meaning
200 {"status": "ok"} Server is ready
503 {"error": {"code": 503, "message": "Loading model"}} Model still loading

Note

This endpoint is public — no API key required. Also available at /v1/health.


Chat Completions (OpenAI-compatible)

POST /v1/chat/completions

The primary endpoint for conversational AI. Fully compatible with OpenAI's API format.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is artificial intelligence?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Request body:

Parameter Type Description Default
messages array Array of {role, content} message objects required
model string Model name (optional, uses loaded model)
max_tokens int Maximum tokens to generate -1 (infinite)
temperature float Randomness (0.0 = deterministic) 0.8
top_p float Nucleus sampling 0.95
top_k int Top-K sampling 40
min_p float Min-P sampling 0.05
stream bool Stream tokens as they generate false
stop array Stop strings []
seed int RNG seed (-1 = random) -1
frequency_penalty float Frequency penalty 0.0
presence_penalty float Presence penalty 0.0
repeat_penalty float Repeat penalty 1.0
logit_bias object Token probability adjustments
n_probs int Return top-N token probabilities 0
grammar string BNF grammar constraint
json_schema object JSON schema constraint

Response:

{
  "id": "chatcmpl-xxxx",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "tinfer-server",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Artificial intelligence is..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 100,
    "total_tokens": 125
  }
}

Text Completions (OpenAI-compatible)

POST /v1/completions

Raw text completion (non-chat format).

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "temperature": 0.5
  }'

Same parameters as chat completions, but uses prompt instead of messages.


List Models

GET /v1/models

List all loaded models.

curl http://localhost:8080/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "model-name",
      "object": "model",
      "owned_by": "tinfer"
    }
  ]
}

Completions (Non-OAI)

POST /completion

Native completion endpoint with more options than the OAI-compatible version.

curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Building a website in 10 steps:", "n_predict": 128}'

Additional parameters (beyond OAI-compatible ones):

Parameter Type Description Default
prompt string/array Text prompt or token array required
n_predict int Max tokens to predict -1
cache_prompt bool Reuse KV cache from previous request true
stream bool Stream tokens in real-time false
id_slot int Assign to specific slot (-1 = auto) -1
return_tokens bool Return raw token IDs false
samplers array Order of samplers to apply ["dry","top_k",...]
t_max_predict_ms int Time limit for generation (0 = disabled) 0
timings_per_token bool Include speed info per response false
n_probs int Top-N probabilities per token 0

Response fields:

Field Description
content Generated text
stop Whether generation stopped
stop_type Why it stopped: none, eos, limit, or word
stopping_word The stop word that triggered stop
model Model alias
tokens_evaluated Prompt tokens processed
tokens_cached Tokens reused from cache
truncated Whether context was exceeded
timings Speed statistics

Tokenize

POST /tokenize

Convert text to tokens.

curl http://localhost:8080/tokenize \
  -H "Content-Type: application/json" \
  -d '{"content": "Hello, world!"}'
Parameter Type Description Default
content string Text to tokenize required
add_special bool Insert BOS token false
with_pieces bool Return token text pieces false

Detokenize

POST /detokenize

Convert tokens back to text.

curl http://localhost:8080/detokenize \
  -H "Content-Type: application/json" \
  -d '{"tokens": [123, 456, 789]}'

Embeddings

POST /v1/embeddings

Generate text embeddings. Requires --embedding flag.

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world", "model": "model-name"}'

Non-OAI endpoint /embedding is also available with content parameter and optional embd_normalize.


Reranking

POST /v1/rerank

Rerank documents by query relevance. Requires --rerank flag and a reranker model.

curl http://localhost:8080/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is a panda?",
    "documents": ["hi", "it is a bear", "The giant panda is a bear species."],
    "top_n": 3
  }'

Server Properties

GET /props

Get server configuration and default generation settings.

curl http://localhost:8080/props

Returns model path, chat template, default sampling parameters, modalities, and slot count.


Apply Template

POST /apply-template

Convert chat messages to a prompt string using the model's template, without running inference.

curl http://localhost:8080/apply-template \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}]}'

Code Infill

POST /infill

Fill-in-the-middle code completion (FIM).

curl http://localhost:8080/infill \
  -H "Content-Type: application/json" \
  -d '{
    "input_prefix": "def hello():\n    ",
    "input_suffix": "\n    return result"
  }'
Parameter Description
input_prefix Code before the cursor
input_suffix Code after the cursor
input_extra Additional context: [{"filename": "...", "text": "..."}]

Authentication

If --api-key is set, include the key in your requests:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{"messages": [{"role": "user", "content": "Hello"}]}'