Speculative Decoding¶
Speculative decoding dramatically speeds up text generation by predicting multiple tokens ahead and verifying them in a single batch. The output is mathematically identical to running the model normally — but significantly faster.
Why is it faster?
Normally, the model generates tokens one-by-one. Speculative decoding predicts multiple tokens at once and verifies them in a single batch — computing n tokens together is much more efficient than computing them sequentially.
Two Approaches¶
Tinfer supports speculative decoding with or without a draft model. You can even combine both methods together.
1. Draft Model (Most Common)¶
A smaller, faster model from the same family generates draft tokens, which are then verified by the main model:
tinfer \
-m models/Llama-3.2-8B-Q4_K_M.gguf \
-md models/Llama-3.2-1B-Q4_K_M.gguf \
-p "Explain quantum computing" -n 200 -c 2048
Model compatibility
Both models must share the same tokenizer (same model family). Using models from different families will produce incorrect output.
Recommended model pairs:
| Main Model (Large) | Draft Model (Small) | Expected Speedup |
|---|---|---|
| Llama 3.2 8B | Llama 3.2 1B | ~2-3x |
| Llama 3.1 70B | Llama 3.1 8B | ~2-4x |
| Qwen 2.5 14B | Qwen 2.5 3B | ~2-3x |
| Mistral 7B | Mistral 0.5B | ~2-3x |
2. Draftless (No Extra Model Needed)¶
These methods use n-gram pattern matching from the generated text itself — no second model required. Particularly effective for code refactoring, summarization, and reasoning models.
ngram-simple¶
Looks for the last matching n-gram in history and uses the following tokens as a draft. Simplest approach with minimal overhead:
ngram-map-k¶
Uses a hash-map of n-grams in the current context. More accurate than ngram-simple, requires a minimum number of occurrences before drafting:
ngram-mod¶
Lightweight (~16 MB) hash-based approach with constant memory. The hash pool is shared across all server slots, so different requests benefit from each other:
tinfer-server -m model.gguf \
--spec-type ngram-mod \
--spec-ngram-size-n 24 \
--draft-min 48 --draft-max 64
Best use cases for draftless methods
- Code refactoring — iterating over existing code blocks
- Reasoning models — when they repeat thinking in the final answer
- Summarization — text that overlaps heavily with the input
Combining Both Methods¶
You can use a draft model together with a draftless method. When combined, the draftless method takes higher precedence:
tinfer-server \
-m models/large-model.gguf \
-md models/small-model.gguf \
--spec-type ngram-simple \
--draft-max 64 --port 8080 -c 2048
Tuning Parameters¶
Draft Token Count¶
tinfer -m large.gguf -md small.gguf \
--draft-max 16 \
--draft-min 2 \
--draft-p-min 0.75 \
-p "Hello!" -n 200 -c 2048
| Flag | Default | Description |
|---|---|---|
--draft-max |
16 |
Max tokens to draft per iteration |
--draft-min |
0 |
Min tokens before verification |
--draft-p-min |
0.75 |
Min probability for draft acceptance (greedy) |
n-gram Parameters (Draftless)¶
| Flag | Default | Description |
|---|---|---|
--spec-ngram-size-n |
12 |
Size of the lookup n-gram (key length) |
--spec-ngram-size-m |
48 |
Size of the draft m-gram (draft length) |
--spec-ngram-min-hits |
1 |
Min occurrences before using an n-gram as draft |
GPU Layer Control¶
| Flag | Description |
|---|---|
-ngl |
GPU layers for the main model |
-ngld |
GPU layers for the draft model |
-cd |
Context size for draft model (0 = same as main) |
-devd |
Device for draft model offloading |
Low VRAM strategy
Keep draft on CPU (-ngld 0) while main model uses GPU (-ngl 99). The draft model is small enough to run fast on CPU.
Reading Statistics¶
Run with --verbose to see acceptance statistics:
draft acceptance rate = 0.57576 (171 accepted / 297 generated)
statistics draft: #calls = 10, #gen drafts = 10, #acc drafts = 10,
#gen tokens = 110, #acc tokens = 98
| Metric | Meaning |
|---|---|
acceptance rate |
Fraction of draft tokens accepted by main model |
#gen tokens |
Total tokens generated by draft (including rejected) |
#acc tokens |
Tokens accepted by the main model |
Target acceptance rate
Aim for > 50% acceptance. If lower, try a better-matched draft model or reduce --draft-max.
Spec Type Reference¶
| Type | Draft Model? | Description |
|---|---|---|
none |
— | No speculative decoding (default) |
draft |
✅ | Use a separate draft model |
ngram-cache |
❌ | N-gram cache lookup |
ngram-simple |
❌ | Simple n-gram pattern matching |
ngram-map-k |
❌ | N-gram pattern matching with hash-map keys |
ngram-map-k4v |
❌ | N-gram with up to 4 tracked values (experimental) |
ngram-mod |
❌ | Hash-based n-gram with shared pool across slots |