Layer Offloading¶
Layer offloading is an experimental feature that enables running models larger than your GPU VRAM by dynamically swapping layer weights between Disk, CPU, and GPU using a sliding window with async prefetching.
The Problem¶
When -ngl is less than the total layer count, the remaining layers compute on CPU, which is significantly slower. Users with limited VRAM are stuck with slow CPU inference for those layers.
The Solution¶
Layer offloading keeps only a window of N layers in GPU staging buffers at a time. Before each layer computation:
- Swap in — CPU layer weights are temporarily redirected to GPU staging via pointer swaps
- Transfer — Weight data is copied from CPU → GPU staging buffer
- Compute — GPU computes using the staged weights (fast!)
- Swap back — Pointers are restored to original CPU locations
This means even if you can only fit 5 layers permanently in VRAM, all layers still get GPU-speed compute through the sliding window.
Three Tiers¶
| Tier | Location | Behavior |
|---|---|---|
| GPU | VRAM (permanent) | Controlled by -ngl, always on GPU |
| CPU | System RAM | Windowed into GPU staging as needed |
| Disk | GGUF file | Loaded into CPU cache on demand (LRU eviction) |
Model: 48 layers total, -ngl 5
Layer 0-42: CPU tier ─── windowed through GPU staging ───→ GPU compute
Layer 43-47: GPU tier ─── permanently on GPU ──────────────→ GPU compute
-nglcontrols how many layers are permanently on GPU--layer-windowcontrols how CPU-tier layers are temporarily windowed into GPU- If
-nglcovers all layers,--layer-windowhas no effect
Quick Start¶
# Auto-detect window size based on free VRAM
tinfer -m model.gguf -ngl 5 --layer-window auto -p "Hello" -n 100
# Manual window size (4 CPU layers windowed at a time)
tinfer -m model.gguf -ngl 5 --layer-window 4 -p "Hello" -n 100
# Disable async prefetching (sync transfer only)
tinfer -m model.gguf -ngl 5 --layer-window auto --no-layer-prefetch -p "Hello" -n 100
# Works with the server too
tinfer-server -m model.gguf -ngl 5 --layer-window auto --port 8080
Flags¶
| Flag | Default | Description |
|---|---|---|
--layer-window N |
0 (disabled) |
auto = detect from free VRAM, or exact number of layers to window. (env: LLAMA_ARG_LAYER_WINDOW) |
--no-layer-prefetch |
prefetch enabled | Disable async prefetching of the next window |
How Auto-Detection Works¶
When you use --layer-window auto, the system:
- Measures free VRAM after loading the model
- Calculates the largest layer size
- Determines how many layers can fit in a double-buffered staging area
- Sets the window size accordingly
Architecture¶
CLI (--layer-window N)
→ llama_model_params.layer_window / .layer_prefetch
→ load_tensors(): init window, assign tiers, allocate staging
→ process_ubatch(): swap_layer_to_gpu → graph_compute → swap_layer_to_cpu
Phase A (scheduler-level):
Prefetch next split's inputs during current split compute
Phase B (layer-level):
Sliding window of N layers on double-buffered GPU staging
Pointer swap preserves graph topology
Phase C (disk tier):
Read layers from GGUF file via direct I/O
LRU CPU cache with eviction
Double-Buffered Staging¶
Two staging buffer slots allow computing from one while loading data into the other, enabling overlap of data transfer and computation.
Key Design Decisions¶
- Graph reuse preserved — only
tensor->dataandtensor->bufferpointers are swapped, graph nodes never change - Double-buffered staging — overlap data transfer and computation
- Automatic tensor handling — adapts to upstream model struct changes automatically
Limitations¶
- Pinned memory (
cudaMallocHost) not yet used (fallback tomalloc) - Disk tier I/O is synchronous reads (not yet producer-consumer queued)
- Not yet tested with speculative decoding or multi-GPU split modes