Skip to content

PagedAttention

PagedAttention is an experimental memory management feature for the KV cache that reduces memory fragmentation and enables efficient context shifting for multi-sequence workloads.


The Problem

The default KV cache uses a contiguous ring buffer. When multiple sequences are active (e.g., parallel requests on a server), memory gets fragmented — gaps appear between sequences, wasting VRAM.

The Solution

PagedAttention divides the KV cache into fixed-size blocks (default: 32 tokens). Sequences map logical token positions to physical blocks via a block table — similar to how an operating system manages virtual memory with paging.


Key Benefits

Benefit Description
Zero fragmentation Blocks are allocated on demand — no wasted gaps between sequences
O(1) context shift Remove or remap blocks in the table instead of moving actual data
Copy-on-Write (CoW) Shared sequences (e.g., beam search) share blocks until one writes, then a copy is made

Quick Start

# Enable PagedAttention
tinfer -m model.gguf --kv-cache-paged -p "Hello" -n 100

# With the server (great for multi-user)
tinfer-server -m model.gguf --kv-cache-paged --port 8080

# Disable (default behavior)
tinfer -m model.gguf --no-kv-cache-paged -p "Hello" -n 100

Flags

Flag Default Description
--kv-cache-paged disabled Enable paged KV cache
--no-kv-cache-paged Explicitly disable paged KV cache

Block Size

The default block size is 32 tokens, which aligns with:

  • F16 KV cache
  • Q8_0 quantized KV cache (quant block = 32)
  • Q4_0 quantized KV cache (quant block = 32)

Note

Q4_K quantization uses a quant block of 256, which may need a larger block size for optimal alignment. This is a future optimization.


How It Works Internally

CLI (--kv-cache-paged)
  → llama_context_params.kv_cache_paged
  → llama_kv_cache constructor (paged=true)
      → BlockAllocator (manages free block pool)
      → BlockTable (maps seq_id → block list)

BlockAllocator — manages a pool of free blocks. When a sequence needs more room, it allocates a block. When a sequence is removed, its blocks return to the free pool.

BlockTable — maps each sequence ID to a list of blocks. Lookup is O(1). Context shifting just remaps the table entries instead of copying data.


Performance

Tested with SmolLM2-135M-Instruct-Q8_0 on CPU:

Mode Prompt (tokens/sec) Generation (tokens/sec)
Baseline (ring buffer) 374.8 24.3
PagedAttention 512.5 27.6

No throughput regression — PagedAttention matches or exceeds baseline performance.


When to Use PagedAttention

Scenario Recommendation
Single-user CLI chat Optional — minimal fragmentation anyway
Multi-user server Recommended — significant memory savings
Beam search / parallel decoding Recommended — CoW reduces memory by sharing
Long context generation Recommended — O(1) context shifting

Limitations

  • Block size is fixed at 32 (not yet configurable)
  • Q4_K quantization may not align optimally with block_size=32
  • No custom paged flash attention kernel yet (future optimization)