CLI Reference
The tinfer command provides direct text generation and interactive chat from your terminal.
Basic Usage
# Simple prompt
tinfer -m model.gguf -p "What is AI?" -n 100
# Interactive conversation mode
tinfer -m model.gguf -cnv
# With GPU acceleration (offload all layers)
tinfer -m model.gguf -p "Hello" -ngl 99
Model Options
| Flag |
Description |
Default |
-m, --model FNAME |
Path to the GGUF model file |
— |
-mu, --model-url URL |
Download model from URL |
— |
-hf, --hf-repo <user>/<model>[:quant] |
HuggingFace model repository. Quant is optional, defaults to Q4_K_M |
— |
-hff, --hf-file FILE |
Specific file from HuggingFace repo |
— |
-hft, --hf-token TOKEN |
HuggingFace access token (env: HF_TOKEN) |
— |
Generation Options
| Flag |
Description |
Default |
-p, --prompt TEXT |
Text prompt for generation |
— |
-f, --file FNAME |
Read prompt from a file |
— |
-n, --predict N |
Number of tokens to predict (-1 = infinity) |
-1 |
-c, --ctx-size N |
Context window size (0 = loaded from model) |
0 |
-b, --batch-size N |
Logical maximum batch size |
2048 |
-ub, --ubatch-size N |
Physical maximum batch size |
512 |
-cnv, --conversation |
Enable conversation mode |
off |
-e, --escape |
Process escape sequences (\n, \t, etc.) |
on |
--keep N |
Tokens to keep from initial prompt (0 = none, -1 = all) |
0 |
CPU / Thread Options
| Flag |
Description |
Default |
-t, --threads N |
CPU threads for generation (env: LLAMA_ARG_THREADS) |
auto |
-tb, --threads-batch N |
Threads for batch/prompt processing |
same as -t |
-C, --cpu-mask M |
CPU affinity mask (hex) |
— |
-Cr, --cpu-range lo-hi |
CPU range for affinity |
— |
--cpu-strict <0\|1> |
Strict CPU placement |
0 |
--prio N |
Process priority: -1=low, 0=normal, 1=medium, 2=high, 3=realtime |
0 |
GPU Options
| Flag |
Description |
Default |
-ngl, --n-gpu-layers N |
Layers to offload to GPU (auto, number, or all) |
auto |
-sm, --split-mode {none,layer,row} |
How to split model across GPUs |
layer |
-ts, --tensor-split N0,N1,... |
Fraction of model per GPU |
— |
-mg, --main-gpu INDEX |
Main GPU index |
0 |
-dev, --device <dev1,dev2,...> |
Devices for offloading |
— |
--list-devices |
Print available devices and exit |
— |
-fit, --fit [on\|off] |
Auto-adjust to fit in VRAM |
on |
-fitt, --fit-target MiB |
Target margin per device for --fit |
1024 |
Memory Options
| Flag |
Description |
Default |
-ctk, --cache-type-k TYPE |
KV cache type for K (f32, f16, bf16, q8_0, q4_0, etc.) |
f16 |
-ctv, --cache-type-v TYPE |
KV cache type for V |
f16 |
--mlock |
Force model to stay in RAM |
off |
--mmap, --no-mmap |
Memory-map the model file |
on |
-kvo, --kv-offload |
Enable KV cache offloading |
on |
--no-host |
Bypass host buffer |
off |
-cmoe, --cpu-moe |
Keep all MoE weights in CPU |
off |
Layer Offloading
Run models larger than your GPU VRAM by dynamically swapping layers between Disk, CPU, and GPU using a sliding window with async prefetching.
# Auto-detect window size based on free VRAM
tinfer -m model.gguf -ngl 5 --layer-window auto -p "Hello" -n 100
# Manual window size (4 CPU layers windowed at a time)
tinfer -m model.gguf -ngl 5 --layer-window 4 -p "Hello" -n 100
How it works: When -ngl is less than total layers, remaining layers normally compute on slow CPU. Layer offloading instead keeps a sliding window of N layers in GPU staging buffers, temporarily swapping them in for fast GPU compute, then swapping back.
| Tier |
Location |
Behavior |
| GPU |
VRAM (permanent) |
Controlled by -ngl, always on GPU |
| CPU |
System RAM |
Windowed into GPU staging as needed |
| Disk |
GGUF file |
Loaded into CPU cache on demand (LRU eviction) |
| Flag |
Description |
Default |
--layer-window N |
auto = detect from free VRAM, or exact number of layers to window (env: LLAMA_ARG_LAYER_WINDOW) |
0 (disabled) |
--no-layer-prefetch |
Disable async prefetching of next window |
enabled |
Tip
Use --layer-window auto with a small -ngl to run models that don't fit in VRAM. The system will auto-detect how many layers it can window through GPU staging.
PagedAttention
Reduces KV cache memory fragmentation and enables efficient context shifting for multi-sequence workloads.
# Enable PagedAttention
tinfer -m model.gguf --kv-cache-paged -p "Hello" -n 100
How it works: Instead of a contiguous ring buffer, the KV cache is divided into fixed-size blocks (32 tokens). Sequences map positions to physical blocks via a block table — like OS virtual memory paging.
Benefits:
- Zero fragmentation — blocks allocated on demand, no wasted gaps
- O(1) context shift — remap blocks instead of moving data
- Copy-on-Write — shared sequences (beam search) share blocks until written
| Flag |
Description |
Default |
--kv-cache-paged |
Enable paged KV cache |
disabled |
--no-kv-cache-paged |
Disable paged KV cache |
— |
KV Cache Eviction
Intelligently removes KV cache entries when full, enabling infinite-length text generation without quality loss from context shifting.
# StreamingLLM mode: keep sinks + recent, evict oldest middle
tinfer -m model.gguf --kv-eviction 1 --ctx-size 2048 -p "Hello" -n 5000
# Scored mode: evict least-recently-accessed (better quality)
tinfer -m model.gguf --kv-eviction 2 --ctx-size 2048 -p "Hello" -n 5000
# Protect a 128-token system prompt from eviction
tinfer -m model.gguf --kv-eviction 1 --kv-sink-tokens 8 --kv-protected-tokens 128
How it works: When the cache fills, instead of discarding the oldest half (context shift), smart eviction selectively removes individual entries:
- Sink tokens — first N positions always kept (attention sinks)
- Protected tokens — system prompt or critical prefix preserved
- Recent window — last 25% of each sequence preserved
| Mode |
Flag |
Strategy |
Best For |
| Disabled |
--kv-eviction 0 |
Falls back to context shift |
Default |
| Streaming |
--kv-eviction 1 |
Evict oldest middle tokens |
Simple, fast |
| Scored |
--kv-eviction 2 |
Evict least-recently-accessed |
Better quality |
| Flag |
Description |
Default |
--kv-eviction MODE |
Eviction mode: 0=none, 1=streaming, 2=scored |
0 |
--kv-sink-tokens N |
Initial positions to always keep (0-256) |
4 |
--kv-protected-tokens N |
Positions to protect (e.g. system prompt length) |
0 |
| Flag |
Description |
Default |
--rope-scaling {none,linear,yarn} |
RoPE frequency scaling method |
model default |
--rope-scale N |
RoPE context scaling factor |
— |
--rope-freq-base N |
RoPE base frequency (NTK-aware) |
model default |
--rope-freq-scale N |
RoPE frequency scaling factor |
— |
--yarn-orig-ctx N |
YaRN original context size |
0 |
--yarn-ext-factor N |
YaRN extrapolation mix factor |
-1.0 |
Sampling Options
| Flag |
Description |
Default |
--temp N |
Temperature |
0.8 |
--top-k N |
Top-K sampling (0 = disabled) |
40 |
--top-p N |
Top-P / nucleus sampling (1.0 = disabled) |
0.95 |
--min-p N |
Min-P sampling (0.0 = disabled) |
0.05 |
-s, --seed N |
RNG seed (-1 = random) |
-1 |
--repeat-penalty N |
Repetition penalty (1.0 = disabled) |
1.0 |
--repeat-last-n N |
Tokens to consider for penalty (0 = disabled) |
64 |
--presence-penalty N |
Presence penalty (0.0 = disabled) |
0.0 |
--frequency-penalty N |
Frequency penalty (0.0 = disabled) |
0.0 |
--mirostat N |
Mirostat sampling (0=off, 1=v1, 2=v2) |
0 |
--mirostat-lr N |
Mirostat learning rate (eta) |
0.1 |
--mirostat-ent N |
Mirostat target entropy (tau) |
5.0 |
--typical N |
Locally typical sampling (1.0 = disabled) |
1.0 |
--dynatemp-range N |
Dynamic temperature range (0.0 = disabled) |
0.0 |
--samplers SAMPLERS |
Sampler order, separated by ; |
penalties;dry;top_k;... |
DRY Sampling
| Flag |
Description |
Default |
--dry-multiplier N |
DRY penalty multiplier (0.0 = disabled) |
0.0 |
--dry-base N |
DRY base value |
1.75 |
--dry-allowed-length N |
Allowed repeat length before penalty |
2 |
--dry-penalty-last-n N |
Tokens to scan for repeats (-1 = ctx size) |
-1 |
--dry-sequence-breaker STR |
Sequence breaker strings |
\n, :, ", * |
Grammar / Structured Output
| Flag |
Description |
Default |
--grammar GRAMMAR |
BNF-like grammar to constrain output |
— |
--grammar-file FNAME |
Read grammar from file |
— |
-j, --json-schema SCHEMA |
JSON schema constraint |
— |
-jf, --json-schema-file FILE |
Read JSON schema from file |
— |
Advanced Options
| Flag |
Description |
Default |
--lora FNAME |
Path to LoRA adapter (comma-separated for multiple) |
— |
--lora-scaled FNAME:SCALE,... |
LoRA with custom scaling |
— |
--control-vector FNAME |
Control vector file |
— |
--override-kv KEY=TYPE:VALUE |
Override model metadata |
— |
--check-tensors |
Validate model tensor data |
off |
--numa {distribute,isolate,numactl} |
NUMA optimizations |
— |
-fa, --flash-attn [on\|off\|auto] |
Flash Attention |
auto |
--verbose-prompt |
Print prompt before generation |
off |
Logging Options
| Flag |
Description |
Default |
-v, --verbose |
Maximum verbosity (debug all) |
off |
-lv, --verbosity N |
Verbosity level: 0=output, 1=error, 2=warn, 3=info, 4=debug |
3 |
--log-file FNAME |
Log to file (env: LLAMA_LOG_FILE) |
— |
--log-disable |
Disable logging |
off |
--log-colors [on\|off\|auto] |
Colored log output |
auto |
Misc
| Flag |
Description |
-h, --help |
Print usage and exit |
--version |
Show version and build info |
--license |
Show license info |
--completion-bash |
Print bash completion script |
--offline |
Offline mode (no network access) |