Layer Offloading¶

Layer offloading is an experimental feature that enables running models larger than your GPU VRAM by dynamically swapping layer weights between Disk, CPU, and GPU using a sliding window with async prefetching.

The Problem¶

When -ngl is less than the total layer count, the remaining layers compute on CPU, which is significantly slower. Users with limited VRAM are stuck with slow CPU inference for those layers.

The Solution¶

Layer offloading keeps only a window of N layers in GPU staging buffers at a time. Before each layer computation:

Swap in — CPU layer weights are temporarily redirected to GPU staging via pointer swaps
Transfer — Weight data is copied from CPU → GPU staging buffer
Compute — GPU computes using the staged weights (fast!)
Swap back — Pointers are restored to original CPU locations

This means even if you can only fit 5 layers permanently in VRAM, all layers still get GPU-speed compute through the sliding window.

Three Tiers¶

Tier	Location	Behavior
GPU	VRAM (permanent)	Controlled by `-ngl`, always on GPU
CPU	System RAM	Windowed into GPU staging as needed
Disk	GGUF file	Loaded into CPU cache on demand (LRU eviction)

Model: 48 layers total, -ngl 5

Layer 0-42:  CPU tier  ─── windowed through GPU staging ───→ GPU compute
Layer 43-47: GPU tier  ─── permanently on GPU ──────────────→ GPU compute

-ngl controls how many layers are permanently on GPU
--layer-window controls how CPU-tier layers are temporarily windowed into GPU
If -ngl covers all layers, --layer-window has no effect

Quick Start¶

# Auto-detect window size based on free VRAM
tinfer -m model.gguf -ngl 5 --layer-window auto -p "Hello" -n 100

# Manual window size (4 CPU layers windowed at a time)
tinfer -m model.gguf -ngl 5 --layer-window 4 -p "Hello" -n 100

# Disable async prefetching (sync transfer only)
tinfer -m model.gguf -ngl 5 --layer-window auto --no-layer-prefetch -p "Hello" -n 100

# Works with the server too
tinfer-server -m model.gguf -ngl 5 --layer-window auto --port 8080

Flags¶

Flag	Default	Description
`--layer-window N`	`0` (disabled)	`auto` = detect from free VRAM, or exact number of layers to window. (env: `LLAMA_ARG_LAYER_WINDOW`)
`--no-layer-prefetch`	prefetch enabled	Disable async prefetching of the next window

How Auto-Detection Works¶

When you use --layer-window auto, the system:

Measures free VRAM after loading the model
Calculates the largest layer size
Determines how many layers can fit in a double-buffered staging area
Sets the window size accordingly

Architecture¶

CLI (--layer-window N)
  → llama_model_params.layer_window / .layer_prefetch
  → load_tensors(): init window, assign tiers, allocate staging
  → process_ubatch(): swap_layer_to_gpu → graph_compute → swap_layer_to_cpu

Phase A (scheduler-level):
  Prefetch next split's inputs during current split compute

Phase B (layer-level):
  Sliding window of N layers on double-buffered GPU staging
  Pointer swap preserves graph topology

Phase C (disk tier):
  Read layers from GGUF file via direct I/O
  LRU CPU cache with eviction

Double-Buffered Staging¶

Two staging buffer slots allow computing from one while loading data into the other, enabling overlap of data transfer and computation.

Key Design Decisions¶

Graph reuse preserved — only tensor->data and tensor->buffer pointers are swapped, graph nodes never change
Double-buffered staging — overlap data transfer and computation
Automatic tensor handling — adapts to upstream model struct changes automatically

Limitations¶

Pinned memory (cudaMallocHost) not yet used (fallback to malloc)
Disk tier I/O is synchronous reads (not yet producer-consumer queued)
Not yet tested with speculative decoding or multi-GPU split modes