KV Cache Eviction¶
Smart KV eviction is an experimental feature that intelligently removes KV cache entries when the cache is full, enabling infinite-length text generation without the quality degradation of context shifting.
The Problem¶
When the KV cache fills up during long generation, the default behavior is context shifting — discarding the oldest half of the cache and shifting positions. This causes quality degradation because the model loses important context from the middle of the conversation.
The Solution¶
Smart KV eviction selectively removes individual cache entries based on their importance, preserving:
- Sink tokens — first N positions (attention sinks essential for model stability)
- Protected tokens — system prompt or other critical prefix tokens
- Recent window — last 25% of each sequence (actively being generated)
Only the "middle" tokens — the least important for ongoing generation — are evicted.
Two Eviction Modes¶
Mode 1: Streaming (--kv-eviction 1)¶
Evicts the oldest middle tokens first (position-based). Simple, fast, and reliable.
Based on the StreamingLLM paper — which discovered that the first few tokens ("attention sinks") receive disproportionate attention from the model. Keeping sinks + a recent window preserves generation quality even over very long contexts.
Mode 2: Scored (--kv-eviction 2)¶
Evicts the least-recently-accessed tokens first. Better quality for long conversations because it keeps tokens the model is actively referencing, regardless of their position.
Inspired by H2O (Heavy-Hitter Oracle) — which tracks cumulative attention scores to identify important tokens. This mode uses a lightweight access-recency approximation instead of full attention tracking.
Comparison¶
| Aspect | Streaming (Mode 1) | Scored (Mode 2) |
|---|---|---|
| Strategy | Evict oldest middle tokens | Evict least-recently-accessed |
| Speed | Faster (no scoring overhead) | Slightly slower |
| Quality | Good | Better for long conversations |
| Best for | Simple generation, speed-critical | Multi-turn chat, document processing |
Quick Start¶
# Streaming mode
tinfer -m model.gguf --kv-eviction 1 --ctx-size 2048 -p "Hello" -n 5000
# Scored mode (better quality)
tinfer -m model.gguf --kv-eviction 2 --ctx-size 2048 -p "Hello" -n 5000
# Custom sink + protected tokens
tinfer -m model.gguf --kv-eviction 1 --kv-sink-tokens 8 --kv-protected-tokens 128
# Works with the server
tinfer-server -m model.gguf --kv-eviction 1 --port 8080
# Disabled (default, falls back to context shift)
tinfer -m model.gguf --kv-eviction 0 -p "Hello" -n 100
Flags¶
| Flag | Default | Description |
|---|---|---|
--kv-eviction MODE |
0 (disabled) |
0 = none (context shift), 1 = streaming, 2 = scored |
--kv-sink-tokens N |
4 |
Number of initial positions to always keep (range: 0–256). These are "attention sinks" — tokens the model always references. |
--kv-protected-tokens N |
0 |
Number of positions to protect from eviction. Set this to your system prompt length to ensure it's never evicted. |
How Eviction Triggers¶
Eviction happens automatically — no manual intervention needed:
prepare() fails (no free slots)
→ evict_cells(32+) ← removes least important tokens
→ retry prepare()
→ success! ← generation continues seamlessly
If eviction can't free enough cells (everything is protected), it falls back to the standard failure path.
What Are Sink Tokens?¶
Attention sinks are the first few tokens in a sequence that consistently receive high attention scores from the model, regardless of their actual content. This was discovered by the StreamingLLM paper.
Why they matter:
- The model's attention mechanism relies on these positions as "anchors"
- Removing them destabilizes generation and causes quality collapse
- Even with a 2K context, keeping just 4 sink tokens prevents degradation
The default of --kv-sink-tokens 4 works well for most models.
Protecting System Prompts¶
If you use a system prompt (e.g., 128 tokens of instructions), protect it from eviction:
This ensures your system prompt stays in the cache even during very long conversations.
Safety Features¶
- Shared cells preserved — cells used by multiple sequences are never evicted
- Per-sequence recent window — protects
max(32, seq_len/4)most recent positions per sequence - Input validation — range checks on all parameters
- ISWA compatibility — only the base cache is evicted; SWA cache uses native sliding window
- Zero impact when disabled — default
--kv-eviction 0adds no runtime overhead
Limitations¶
- No per-layer scoring (all layers share the same eviction decision)
- Scored mode uses access recency, not true attention weights (lightweight approximation)
- Not yet tested with speculative decoding
- Eviction granularity is per-cell (not per-block for PagedAttention mode)