Speculative Decoding¶

Speculative decoding dramatically speeds up text generation by predicting multiple tokens ahead and verifying them in a single batch. The output is mathematically identical to running the model normally — but significantly faster.

Why is it faster?

Normally, the model generates tokens one-by-one. Speculative decoding predicts multiple tokens at once and verifies them in a single batch — computing n tokens together is much more efficient than computing them sequentially.

Two Approaches¶

Tinfer supports speculative decoding with or without a draft model. You can even combine both methods together.

1. Draft Model (Most Common)¶

A smaller, faster model from the same family generates draft tokens, which are then verified by the main model:

tinfer \
  -m models/Llama-3.2-8B-Q4_K_M.gguf \
  -md models/Llama-3.2-1B-Q4_K_M.gguf \
  -p "Explain quantum computing" -n 200 -c 2048

Model compatibility

Both models must share the same tokenizer (same model family). Using models from different families will produce incorrect output.

Recommended model pairs:

Main Model (Large)	Draft Model (Small)	Expected Speedup
Llama 3.2 8B	Llama 3.2 1B	~2-3x
Llama 3.1 70B	Llama 3.1 8B	~2-4x
Qwen 2.5 14B	Qwen 2.5 3B	~2-3x
Mistral 7B	Mistral 0.5B	~2-3x

2. Draftless (No Extra Model Needed)¶

These methods use n-gram pattern matching from the generated text itself — no second model required. Particularly effective for code refactoring, summarization, and reasoning models.

ngram-simple¶

Looks for the last matching n-gram in history and uses the following tokens as a draft. Simplest approach with minimal overhead:

tinfer-server -m model.gguf --spec-type ngram-simple --draft-max 64

ngram-map-k¶

Uses a hash-map of n-grams in the current context. More accurate than ngram-simple, requires a minimum number of occurrences before drafting:

tinfer-server -m model.gguf --spec-type ngram-map-k --draft-max 64

ngram-mod¶

Lightweight (~16 MB) hash-based approach with constant memory. The hash pool is shared across all server slots, so different requests benefit from each other:

tinfer-server -m model.gguf \
  --spec-type ngram-mod \
  --spec-ngram-size-n 24 \
  --draft-min 48 --draft-max 64

Best use cases for draftless methods

Code refactoring — iterating over existing code blocks
Reasoning models — when they repeat thinking in the final answer
Summarization — text that overlaps heavily with the input

Combining Both Methods¶

You can use a draft model together with a draftless method. When combined, the draftless method takes higher precedence:

tinfer-server \
  -m models/large-model.gguf \
  -md models/small-model.gguf \
  --spec-type ngram-simple \
  --draft-max 64 --port 8080 -c 2048

Tuning Parameters¶

Draft Token Count¶

tinfer -m large.gguf -md small.gguf \
  --draft-max 16 \
  --draft-min 2 \
  --draft-p-min 0.75 \
  -p "Hello!" -n 200 -c 2048

Flag	Default	Description
`--draft-max`	`16`	Max tokens to draft per iteration
`--draft-min`	`0`	Min tokens before verification
`--draft-p-min`	`0.75`	Min probability for draft acceptance (greedy)

n-gram Parameters (Draftless)¶

Flag	Default	Description
`--spec-ngram-size-n`	`12`	Size of the lookup n-gram (key length)
`--spec-ngram-size-m`	`48`	Size of the draft m-gram (draft length)
`--spec-ngram-min-hits`	`1`	Min occurrences before using an n-gram as draft

GPU Layer Control¶

tinfer -m large.gguf -md small.gguf \
  -ngl 99 -ngld 99 \
  -p "Hello!" -n 200 -c 2048

Flag	Description
`-ngl`	GPU layers for the main model
`-ngld`	GPU layers for the draft model
`-cd`	Context size for draft model (0 = same as main)
`-devd`	Device for draft model offloading

Low VRAM strategy

Keep draft on CPU (-ngld 0) while main model uses GPU (-ngl 99). The draft model is small enough to run fast on CPU.

Reading Statistics¶

Run with --verbose to see acceptance statistics:

draft acceptance rate = 0.57576 (171 accepted / 297 generated)
statistics draft: #calls = 10, #gen drafts = 10, #acc drafts = 10,
  #gen tokens = 110, #acc tokens = 98

Metric	Meaning
`acceptance rate`	Fraction of draft tokens accepted by main model
`#gen tokens`	Total tokens generated by draft (including rejected)
`#acc tokens`	Tokens accepted by the main model

Target acceptance rate

Aim for > 50% acceptance. If lower, try a better-matched draft model or reduce --draft-max.

Spec Type Reference¶

Type	Draft Model?	Description
`none`	—	No speculative decoding (default)
`draft`	✅	Use a separate draft model
`ngram-cache`	❌	N-gram cache lookup
`ngram-simple`	❌	Simple n-gram pattern matching
`ngram-map-k`	❌	N-gram pattern matching with hash-map keys
`ngram-map-k4v`	❌	N-gram with up to 4 tracked values (experimental)
`ngram-mod`	❌	Hash-based n-gram with shared pool across slots