Skip to content

Tinfer — Tiny Inference Engine

Run LLMs locally with GPU acceleration

Tinfer is a high-performance local inference engine built on top of llama.cpp. Run large language models on your own hardware — no cloud, no API keys, no C++ build tools.

⚡ CLI Chat

Run any GGUF model from your terminal with a single command.

🖥️ Server + WebUI

HTTP server with a built-in chat interface at localhost:8080.

🔌 OpenAI-Compatible API

Drop-in replacement for OpenAI's /v1/chat/completions endpoint.

👁️ Vision & OCR

Image understanding, visual QA, and OCR with multimodal models.

🔍 Embedding & Reranking

Generate text embeddings and rerank documents for semantic search.

🎯 LoRA Fine-Tuning

Run fine-tuned models with LoRA adapters — hot-swap at runtime.

📦 Layer Offloading

Run models larger than VRAM — dynamic Disk → CPU → GPU layer swapping.

🧩 PagedAttention

Zero-fragmentation KV cache with O(1) context shifting and Copy-on-Write.

♻️ KV Cache Eviction

Infinite-length generation — smart eviction keeps critical tokens.

🔄 Model Conversion

Convert HuggingFace models & LoRA adapters to GGUF format.

📊 Quantization

30+ quantization types — shrink models up to 10x with minimal quality loss.

⏱️ Benchmarking

Measure tokens/sec for prompt processing and text generation.


Quick Start

# 1. Install
pip install tinfer-ai

# 2. Download a model
pip install huggingface-hub
python -c "from huggingface_hub import hf_hub_download; import os; os.makedirs('models', exist_ok=True); hf_hub_download(repo_id='bartowski/Llama-3.2-3B-Instruct-GGUF', filename='Llama-3.2-3B-Instruct-Q4_K_M.gguf', local_dir='./models')"

# 3. Run
tinfer -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello, what is AI?"

Try it instantly on Google Colab — no local setup needed!

Open in Colab


Three Ways to Use Tinfer

1. CLI — Direct Chat

tinfer -m model.gguf -p "Explain quantum computing" -n 200

2. Server — WebUI + API

tinfer-server -m model.gguf --port 8080
# Open http://localhost:8080 for the chat UI

3. Python — Programmatic Access

from tinfer import Server, chat

with Server("model.gguf", port=8080) as s:
    response = chat("What is artificial intelligence?")
    print(response)

Documentation

Page Description
Installation Install Tinfer via pip
Model Download Download GGUF models from HuggingFace
Model Conversion Convert HuggingFace models & LoRA to GGUF
Quantization Reduce model size with 30+ quantization types
Inference Types Text, Vision, Embedding, Reranking, LoRA
CLI Reference All CLI flags and options
Server Reference Server flags, WebUI, and configuration
API Reference OpenAI-compatible HTTP endpoints
Python SDK Python client and server management
Benchmarking Measure inference speed with tinfer-bench
Layer Offloading Run models larger than VRAM
PagedAttention Zero-fragmentation KV cache
KV Cache Eviction Infinite-length generation
Speculative Decoding Speed up generation with draft models
Troubleshooting Common issues and fixes