Skip to content

Speculative Decoding

Speculative decoding dramatically speeds up text generation by predicting multiple tokens ahead and verifying them in a single batch. The output is mathematically identical to running the model normally — but significantly faster.

Why is it faster?

Normally, the model generates tokens one-by-one. Speculative decoding predicts multiple tokens at once and verifies them in a single batch — computing n tokens together is much more efficient than computing them sequentially.


Two Approaches

Tinfer supports speculative decoding with or without a draft model. You can even combine both methods together.

1. Draft Model (Most Common)

A smaller, faster model from the same family generates draft tokens, which are then verified by the main model:

tinfer \
  -m models/Llama-3.2-8B-Q4_K_M.gguf \
  -md models/Llama-3.2-1B-Q4_K_M.gguf \
  -p "Explain quantum computing" -n 200 -c 2048

Model compatibility

Both models must share the same tokenizer (same model family). Using models from different families will produce incorrect output.

Recommended model pairs:

Main Model (Large) Draft Model (Small) Expected Speedup
Llama 3.2 8B Llama 3.2 1B ~2-3x
Llama 3.1 70B Llama 3.1 8B ~2-4x
Qwen 2.5 14B Qwen 2.5 3B ~2-3x
Mistral 7B Mistral 0.5B ~2-3x

2. Draftless (No Extra Model Needed)

These methods use n-gram pattern matching from the generated text itself — no second model required. Particularly effective for code refactoring, summarization, and reasoning models.

ngram-simple

Looks for the last matching n-gram in history and uses the following tokens as a draft. Simplest approach with minimal overhead:

tinfer-server -m model.gguf --spec-type ngram-simple --draft-max 64

ngram-map-k

Uses a hash-map of n-grams in the current context. More accurate than ngram-simple, requires a minimum number of occurrences before drafting:

tinfer-server -m model.gguf --spec-type ngram-map-k --draft-max 64

ngram-mod

Lightweight (~16 MB) hash-based approach with constant memory. The hash pool is shared across all server slots, so different requests benefit from each other:

tinfer-server -m model.gguf \
  --spec-type ngram-mod \
  --spec-ngram-size-n 24 \
  --draft-min 48 --draft-max 64

Best use cases for draftless methods

  • Code refactoring — iterating over existing code blocks
  • Reasoning models — when they repeat thinking in the final answer
  • Summarization — text that overlaps heavily with the input

Combining Both Methods

You can use a draft model together with a draftless method. When combined, the draftless method takes higher precedence:

tinfer-server \
  -m models/large-model.gguf \
  -md models/small-model.gguf \
  --spec-type ngram-simple \
  --draft-max 64 --port 8080 -c 2048

Tuning Parameters

Draft Token Count

tinfer -m large.gguf -md small.gguf \
  --draft-max 16 \
  --draft-min 2 \
  --draft-p-min 0.75 \
  -p "Hello!" -n 200 -c 2048
Flag Default Description
--draft-max 16 Max tokens to draft per iteration
--draft-min 0 Min tokens before verification
--draft-p-min 0.75 Min probability for draft acceptance (greedy)

n-gram Parameters (Draftless)

Flag Default Description
--spec-ngram-size-n 12 Size of the lookup n-gram (key length)
--spec-ngram-size-m 48 Size of the draft m-gram (draft length)
--spec-ngram-min-hits 1 Min occurrences before using an n-gram as draft

GPU Layer Control

tinfer -m large.gguf -md small.gguf \
  -ngl 99 -ngld 99 \
  -p "Hello!" -n 200 -c 2048
Flag Description
-ngl GPU layers for the main model
-ngld GPU layers for the draft model
-cd Context size for draft model (0 = same as main)
-devd Device for draft model offloading

Low VRAM strategy

Keep draft on CPU (-ngld 0) while main model uses GPU (-ngl 99). The draft model is small enough to run fast on CPU.


Reading Statistics

Run with --verbose to see acceptance statistics:

draft acceptance rate = 0.57576 (171 accepted / 297 generated)
statistics draft: #calls = 10, #gen drafts = 10, #acc drafts = 10,
  #gen tokens = 110, #acc tokens = 98
Metric Meaning
acceptance rate Fraction of draft tokens accepted by main model
#gen tokens Total tokens generated by draft (including rejected)
#acc tokens Tokens accepted by the main model

Target acceptance rate

Aim for > 50% acceptance. If lower, try a better-matched draft model or reduce --draft-max.


Spec Type Reference

Type Draft Model? Description
none No speculative decoding (default)
draft Use a separate draft model
ngram-cache N-gram cache lookup
ngram-simple Simple n-gram pattern matching
ngram-map-k N-gram pattern matching with hash-map keys
ngram-map-k4v N-gram with up to 4 tracked values (experimental)
ngram-mod Hash-based n-gram with shared pool across slots