CLI Reference¶

The tinfer command provides direct text generation and interactive chat from your terminal.

Basic Usage¶

# Simple prompt
tinfer -m model.gguf -p "What is AI?" -n 100

# Interactive conversation mode
tinfer -m model.gguf -cnv

# With GPU acceleration (offload all layers)
tinfer -m model.gguf -p "Hello" -ngl 99

Model Options¶

Flag	Description	Default
`-m, --model FNAME`	Path to the GGUF model file	—
`-mu, --model-url URL`	Download model from URL	—
`-hf, --hf-repo <user>/<model>[:quant]`	HuggingFace model repository. Quant is optional, defaults to Q4_K_M	—
`-hff, --hf-file FILE`	Specific file from HuggingFace repo	—
`-hft, --hf-token TOKEN`	HuggingFace access token (env: `HF_TOKEN`)	—

Generation Options¶

Flag	Description	Default
`-p, --prompt TEXT`	Text prompt for generation	—
`-f, --file FNAME`	Read prompt from a file	—
`-n, --predict N`	Number of tokens to predict (-1 = infinity)	-1
`-c, --ctx-size N`	Context window size (0 = loaded from model)	0
`-b, --batch-size N`	Logical maximum batch size	2048
`-ub, --ubatch-size N`	Physical maximum batch size	512
`-cnv, --conversation`	Enable conversation mode	off
`-e, --escape`	Process escape sequences (\n, \t, etc.)	on
`--keep N`	Tokens to keep from initial prompt (0 = none, -1 = all)	0

CPU / Thread Options¶

Flag	Description	Default
`-t, --threads N`	CPU threads for generation (env: `LLAMA_ARG_THREADS`)	auto
`-tb, --threads-batch N`	Threads for batch/prompt processing	same as `-t`
`-C, --cpu-mask M`	CPU affinity mask (hex)	—
`-Cr, --cpu-range lo-hi`	CPU range for affinity	—
`--cpu-strict <0\\|1>`	Strict CPU placement	0
`--prio N`	Process priority: -1=low, 0=normal, 1=medium, 2=high, 3=realtime	0

GPU Options¶

Flag	Description	Default
`-ngl, --n-gpu-layers N`	Layers to offload to GPU (`auto`, number, or `all`)	auto
`-sm, --split-mode {none,layer,row}`	How to split model across GPUs	layer
`-ts, --tensor-split N0,N1,...`	Fraction of model per GPU	—
`-mg, --main-gpu INDEX`	Main GPU index	0
`-dev, --device <dev1,dev2,...>`	Devices for offloading	—
`--list-devices`	Print available devices and exit	—
`-fit, --fit [on\\|off]`	Auto-adjust to fit in VRAM	on
`-fitt, --fit-target MiB`	Target margin per device for --fit	1024

Memory Options¶

Flag	Description	Default
`-ctk, --cache-type-k TYPE`	KV cache type for K (f32, f16, bf16, q8_0, q4_0, etc.)	f16
`-ctv, --cache-type-v TYPE`	KV cache type for V	f16
`--mlock`	Force model to stay in RAM	off
`--mmap, --no-mmap`	Memory-map the model file	on
`-kvo, --kv-offload`	Enable KV cache offloading	on
`--no-host`	Bypass host buffer	off
`-cmoe, --cpu-moe`	Keep all MoE weights in CPU	off

Layer Offloading¶

Run models larger than your GPU VRAM by dynamically swapping layers between Disk, CPU, and GPU using a sliding window with async prefetching.

# Auto-detect window size based on free VRAM
tinfer -m model.gguf -ngl 5 --layer-window auto -p "Hello" -n 100

# Manual window size (4 CPU layers windowed at a time)
tinfer -m model.gguf -ngl 5 --layer-window 4 -p "Hello" -n 100

How it works: When -ngl is less than total layers, remaining layers normally compute on slow CPU. Layer offloading instead keeps a sliding window of N layers in GPU staging buffers, temporarily swapping them in for fast GPU compute, then swapping back.

Tier	Location	Behavior
GPU	VRAM (permanent)	Controlled by `-ngl`, always on GPU
CPU	System RAM	Windowed into GPU staging as needed
Disk	GGUF file	Loaded into CPU cache on demand (LRU eviction)

Flag	Description	Default
`--layer-window N`	`auto` = detect from free VRAM, or exact number of layers to window (env: `LLAMA_ARG_LAYER_WINDOW`)	0 (disabled)
`--no-layer-prefetch`	Disable async prefetching of next window	enabled

Tip

Use --layer-window auto with a small -ngl to run models that don't fit in VRAM. The system will auto-detect how many layers it can window through GPU staging.

PagedAttention¶

Reduces KV cache memory fragmentation and enables efficient context shifting for multi-sequence workloads.

# Enable PagedAttention
tinfer -m model.gguf --kv-cache-paged -p "Hello" -n 100

How it works: Instead of a contiguous ring buffer, the KV cache is divided into fixed-size blocks (32 tokens). Sequences map positions to physical blocks via a block table — like OS virtual memory paging.

Benefits:

Zero fragmentation — blocks allocated on demand, no wasted gaps
O(1) context shift — remap blocks instead of moving data
Copy-on-Write — shared sequences (beam search) share blocks until written

Flag	Description	Default
`--kv-cache-paged`	Enable paged KV cache	disabled
`--no-kv-cache-paged`	Disable paged KV cache	—

KV Cache Eviction¶

Intelligently removes KV cache entries when full, enabling infinite-length text generation without quality loss from context shifting.

# StreamingLLM mode: keep sinks + recent, evict oldest middle
tinfer -m model.gguf --kv-eviction 1 --ctx-size 2048 -p "Hello" -n 5000

# Scored mode: evict least-recently-accessed (better quality)
tinfer -m model.gguf --kv-eviction 2 --ctx-size 2048 -p "Hello" -n 5000

# Protect a 128-token system prompt from eviction
tinfer -m model.gguf --kv-eviction 1 --kv-sink-tokens 8 --kv-protected-tokens 128

How it works: When the cache fills, instead of discarding the oldest half (context shift), smart eviction selectively removes individual entries:

Sink tokens — first N positions always kept (attention sinks)
Protected tokens — system prompt or critical prefix preserved
Recent window — last 25% of each sequence preserved

Mode	Flag	Strategy	Best For
Disabled	`--kv-eviction 0`	Falls back to context shift	Default
Streaming	`--kv-eviction 1`	Evict oldest middle tokens	Simple, fast
Scored	`--kv-eviction 2`	Evict least-recently-accessed	Better quality

Flag	Description	Default
`--kv-eviction MODE`	Eviction mode: 0=none, 1=streaming, 2=scored	0
`--kv-sink-tokens N`	Initial positions to always keep (0-256)	4
`--kv-protected-tokens N`	Positions to protect (e.g. system prompt length)	0

Flag	Description	Default
`--rope-scaling {none,linear,yarn}`	RoPE frequency scaling method	model default
`--rope-scale N`	RoPE context scaling factor	—
`--rope-freq-base N`	RoPE base frequency (NTK-aware)	model default
`--rope-freq-scale N`	RoPE frequency scaling factor	—
`--yarn-orig-ctx N`	YaRN original context size	0
`--yarn-ext-factor N`	YaRN extrapolation mix factor	-1.0

Sampling Options¶

Flag	Description	Default
`--temp N`	Temperature	0.8
`--top-k N`	Top-K sampling (0 = disabled)	40
`--top-p N`	Top-P / nucleus sampling (1.0 = disabled)	0.95
`--min-p N`	Min-P sampling (0.0 = disabled)	0.05
`-s, --seed N`	RNG seed (-1 = random)	-1
`--repeat-penalty N`	Repetition penalty (1.0 = disabled)	1.0
`--repeat-last-n N`	Tokens to consider for penalty (0 = disabled)	64
`--presence-penalty N`	Presence penalty (0.0 = disabled)	0.0
`--frequency-penalty N`	Frequency penalty (0.0 = disabled)	0.0
`--mirostat N`	Mirostat sampling (0=off, 1=v1, 2=v2)	0
`--mirostat-lr N`	Mirostat learning rate (eta)	0.1
`--mirostat-ent N`	Mirostat target entropy (tau)	5.0
`--typical N`	Locally typical sampling (1.0 = disabled)	1.0
`--dynatemp-range N`	Dynamic temperature range (0.0 = disabled)	0.0
`--samplers SAMPLERS`	Sampler order, separated by `;`	penalties;dry;top_k;...

DRY Sampling¶

Flag	Description	Default
`--dry-multiplier N`	DRY penalty multiplier (0.0 = disabled)	0.0
`--dry-base N`	DRY base value	1.75
`--dry-allowed-length N`	Allowed repeat length before penalty	2
`--dry-penalty-last-n N`	Tokens to scan for repeats (-1 = ctx size)	-1
`--dry-sequence-breaker STR`	Sequence breaker strings	`\n`, `:`, `"`, `*`

Grammar / Structured Output¶

Flag	Description	Default
`--grammar GRAMMAR`	BNF-like grammar to constrain output	—
`--grammar-file FNAME`	Read grammar from file	—
`-j, --json-schema SCHEMA`	JSON schema constraint	—
`-jf, --json-schema-file FILE`	Read JSON schema from file	—

Advanced Options¶

Flag	Description	Default
`--lora FNAME`	Path to LoRA adapter (comma-separated for multiple)	—
`--lora-scaled FNAME:SCALE,...`	LoRA with custom scaling	—
`--control-vector FNAME`	Control vector file	—
`--override-kv KEY=TYPE:VALUE`	Override model metadata	—
`--check-tensors`	Validate model tensor data	off
`--numa {distribute,isolate,numactl}`	NUMA optimizations	—
`-fa, --flash-attn [on\\|off\\|auto]`	Flash Attention	auto
`--verbose-prompt`	Print prompt before generation	off

Logging Options¶

Flag	Description	Default
`-v, --verbose`	Maximum verbosity (debug all)	off
`-lv, --verbosity N`	Verbosity level: 0=output, 1=error, 2=warn, 3=info, 4=debug	3
`--log-file FNAME`	Log to file (env: `LLAMA_LOG_FILE`)	—
`--log-disable`	Disable logging	off
`--log-colors [on\\|off\\|auto]`	Colored log output	auto

Misc¶

Flag	Description
`-h, --help`	Print usage and exit
`--version`	Show version and build info
`--license`	Show license info
`--completion-bash`	Print bash completion script
`--offline`	Offline mode (no network access)