Tinfer — Tiny Inference Engine¶
Run LLMs locally with GPU acceleration¶
Tinfer is a high-performance local inference engine built on top of llama.cpp. Run large language models on your own hardware — no cloud, no API keys, no C++ build tools.
⚡ CLI Chat¶
Run any GGUF model from your terminal with a single command.
🖥️ Server + WebUI¶
HTTP server with a built-in chat interface at localhost:8080.
🔌 OpenAI-Compatible API¶
Drop-in replacement for OpenAI's /v1/chat/completions endpoint.
👁️ Vision & OCR¶
Image understanding, visual QA, and OCR with multimodal models.
🔍 Embedding & Reranking¶
Generate text embeddings and rerank documents for semantic search.
🎯 LoRA Fine-Tuning¶
Run fine-tuned models with LoRA adapters — hot-swap at runtime.
📦 Layer Offloading¶
Run models larger than VRAM — dynamic Disk → CPU → GPU layer swapping.
🧩 PagedAttention¶
Zero-fragmentation KV cache with O(1) context shifting and Copy-on-Write.
♻️ KV Cache Eviction¶
Infinite-length generation — smart eviction keeps critical tokens.
🔄 Model Conversion¶
Convert HuggingFace models & LoRA adapters to GGUF format.
📊 Quantization¶
30+ quantization types — shrink models up to 10x with minimal quality loss.
⏱️ Benchmarking¶
Measure tokens/sec for prompt processing and text generation.
Quick Start¶
# 1. Install
pip install tinfer-ai
# 2. Download a model
pip install huggingface-hub
python -c "from huggingface_hub import hf_hub_download; import os; os.makedirs('models', exist_ok=True); hf_hub_download(repo_id='bartowski/Llama-3.2-3B-Instruct-GGUF', filename='Llama-3.2-3B-Instruct-Q4_K_M.gguf', local_dir='./models')"
# 3. Run
tinfer -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello, what is AI?"
Three Ways to Use Tinfer¶
1. CLI — Direct Chat¶
2. Server — WebUI + API¶
3. Python — Programmatic Access¶
from tinfer import Server, chat
with Server("model.gguf", port=8080) as s:
response = chat("What is artificial intelligence?")
print(response)
Documentation¶
| Page | Description |
|---|---|
| Installation | Install Tinfer via pip |
| Model Download | Download GGUF models from HuggingFace |
| Model Conversion | Convert HuggingFace models & LoRA to GGUF |
| Quantization | Reduce model size with 30+ quantization types |
| Inference Types | Text, Vision, Embedding, Reranking, LoRA |
| CLI Reference | All CLI flags and options |
| Server Reference | Server flags, WebUI, and configuration |
| API Reference | OpenAI-compatible HTTP endpoints |
| Python SDK | Python client and server management |
| Benchmarking | Measure inference speed with tinfer-bench |
| Layer Offloading | Run models larger than VRAM |
| PagedAttention | Zero-fragmentation KV cache |
| KV Cache Eviction | Infinite-length generation |
| Speculative Decoding | Speed up generation with draft models |
| Troubleshooting | Common issues and fixes |