PagedAttention¶

PagedAttention is an experimental memory management feature for the KV cache that reduces memory fragmentation and enables efficient context shifting for multi-sequence workloads.

The Problem¶

The default KV cache uses a contiguous ring buffer. When multiple sequences are active (e.g., parallel requests on a server), memory gets fragmented — gaps appear between sequences, wasting VRAM.

The Solution¶

PagedAttention divides the KV cache into fixed-size blocks (default: 32 tokens). Sequences map logical token positions to physical blocks via a block table — similar to how an operating system manages virtual memory with paging.

Key Benefits¶

Benefit	Description
Zero fragmentation	Blocks are allocated on demand — no wasted gaps between sequences
O(1) context shift	Remove or remap blocks in the table instead of moving actual data
Copy-on-Write (CoW)	Shared sequences (e.g., beam search) share blocks until one writes, then a copy is made

Quick Start¶

# Enable PagedAttention
tinfer -m model.gguf --kv-cache-paged -p "Hello" -n 100

# With the server (great for multi-user)
tinfer-server -m model.gguf --kv-cache-paged --port 8080

# Disable (default behavior)
tinfer -m model.gguf --no-kv-cache-paged -p "Hello" -n 100

Flags¶

Flag	Default	Description
`--kv-cache-paged`	disabled	Enable paged KV cache
`--no-kv-cache-paged`	—	Explicitly disable paged KV cache

Block Size¶

The default block size is 32 tokens, which aligns with:

F16 KV cache
Q8_0 quantized KV cache (quant block = 32)
Q4_0 quantized KV cache (quant block = 32)

Note

Q4_K quantization uses a quant block of 256, which may need a larger block size for optimal alignment. This is a future optimization.

How It Works Internally¶

CLI (--kv-cache-paged)
  → llama_context_params.kv_cache_paged
  → llama_kv_cache constructor (paged=true)
      → BlockAllocator (manages free block pool)
      → BlockTable (maps seq_id → block list)

BlockAllocator — manages a pool of free blocks. When a sequence needs more room, it allocates a block. When a sequence is removed, its blocks return to the free pool.

BlockTable — maps each sequence ID to a list of blocks. Lookup is O(1). Context shifting just remaps the table entries instead of copying data.

Performance¶

Tested with SmolLM2-135M-Instruct-Q8_0 on CPU:

Mode	Prompt (tokens/sec)	Generation (tokens/sec)
Baseline (ring buffer)	374.8	24.3
PagedAttention	512.5	27.6

No throughput regression — PagedAttention matches or exceeds baseline performance.

When to Use PagedAttention¶

Scenario	Recommendation
Single-user CLI chat	Optional — minimal fragmentation anyway
Multi-user server	Recommended — significant memory savings
Beam search / parallel decoding	Recommended — CoW reduces memory by sharing
Long context generation	Recommended — O(1) context shifting

Limitations¶

Block size is fixed at 32 (not yet configurable)
Q4_K quantization may not align optimally with block_size=32
No custom paged flash attention kernel yet (future optimization)