Tinfer — Tiny Inference Engine¶

Run LLMs locally with GPU acceleration¶

Tinfer is a high-performance local inference engine built on top of llama.cpp. Run large language models on your own hardware — no cloud, no API keys, no C++ build tools.

⚡ CLI Chat¶

Run any GGUF model from your terminal with a single command.

🖥️ Server + WebUI¶

HTTP server with a built-in chat interface at localhost:8080.

🔌 OpenAI-Compatible API¶

Drop-in replacement for OpenAI's /v1/chat/completions endpoint.

👁️ Vision & OCR¶

Image understanding, visual QA, and OCR with multimodal models.

🔍 Embedding & Reranking¶

Generate text embeddings and rerank documents for semantic search.

🎯 LoRA Fine-Tuning¶

Run fine-tuned models with LoRA adapters — hot-swap at runtime.

📦 Layer Offloading¶

Run models larger than VRAM — dynamic Disk → CPU → GPU layer swapping.

🧩 PagedAttention¶

Zero-fragmentation KV cache with O(1) context shifting and Copy-on-Write.

♻️ KV Cache Eviction¶

Infinite-length generation — smart eviction keeps critical tokens.

🔄 Model Conversion¶

Convert HuggingFace models & LoRA adapters to GGUF format.

📊 Quantization¶

30+ quantization types — shrink models up to 10x with minimal quality loss.

⏱️ Benchmarking¶

Measure tokens/sec for prompt processing and text generation.

Quick Start¶

# 1. Install
pip install tinfer-ai

# 2. Download a model
pip install huggingface-hub
python -c "from huggingface_hub import hf_hub_download; import os; os.makedirs('models', exist_ok=True); hf_hub_download(repo_id='bartowski/Llama-3.2-3B-Instruct-GGUF', filename='Llama-3.2-3B-Instruct-Q4_K_M.gguf', local_dir='./models')"

# 3. Run
tinfer -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello, what is AI?"

Try it instantly on Google Colab — no local setup needed!

Three Ways to Use Tinfer¶

1. CLI — Direct Chat¶

tinfer -m model.gguf -p "Explain quantum computing" -n 200

2. Server — WebUI + API¶

tinfer-server -m model.gguf --port 8080
# Open http://localhost:8080 for the chat UI

3. Python — Programmatic Access¶

from tinfer import Server, chat

with Server("model.gguf", port=8080) as s:
    response = chat("What is artificial intelligence?")
    print(response)

Documentation¶

Page	Description
Installation	Install Tinfer via pip
Model Download	Download GGUF models from HuggingFace
Model Conversion	Convert HuggingFace models & LoRA to GGUF
Quantization	Reduce model size with 30+ quantization types
Inference Types	Text, Vision, Embedding, Reranking, LoRA
CLI Reference	All CLI flags and options
Server Reference	Server flags, WebUI, and configuration
API Reference	OpenAI-compatible HTTP endpoints
Python SDK	Python client and server management
Benchmarking	Measure inference speed with tinfer-bench
Layer Offloading	Run models larger than VRAM
PagedAttention	Zero-fragmentation KV cache
KV Cache Eviction	Infinite-length generation
Speculative Decoding	Speed up generation with draft models
Troubleshooting	Common issues and fixes