Benchmarking¶

Measure inference speed (tokens/second) for prompt processing and text generation using tinfer-bench.

Quick Start¶

# Basic benchmark (uses GPU by default)
tinfer-bench -m model.gguf

# Benchmark with all layers on GPU
tinfer-bench -m model.gguf -ngl 99

# Custom prompt/generation lengths
tinfer-bench -m model.gguf -p 512 -n 128 -ngl 99

What It Measures¶

Metric	Description
pp (prompt processing)	Speed of processing the input prompt (tokens/sec)
tg (text generation)	Speed of generating output tokens (tokens/sec)

Higher values = faster. The output shows model name, size, backend, and throughput with standard deviation.

Example Output¶

| model                 |   size |  params | backend | ngl |  test  |         t/s |
| llama 256M Q8_0       | 136 MiB| 134.52 M| CUDA    |  99 | tg128  | 72.41 ± 11.67 |

All Flags¶

Model & Input¶

Flag	Description	Default
`-m, --model <file>`	Model file path (.gguf)	`models/7B/ggml-model-q4_0.gguf`
`-p, --n-prompt <n>`	Number of prompt tokens to process	`512`
`-n, --n-gen <n>`	Number of tokens to generate	`128`
`-pg <pp,tg>`	Combined prompt + generation test (e.g. `-pg 512,128`)	—
`-d, --n-depth <n>`	Context depth for the test	`0`

Batch Processing¶

Flag	Description	Default
`-b, --batch-size <n>`	Batch size for prompt processing	`2048`
`-ub, --ubatch-size <n>`	Micro-batch size	`512`

GPU & Device¶

Flag	Description	Default
`-ngl, --n-gpu-layers <n>`	Number of layers to offload to GPU	`99`
`-mg, --main-gpu <i>`	Main GPU index	`0`
`-sm, --split-mode <mode>`	Multi-GPU split: `none`, `layer`, `row`	`layer`
`-dev, --device <dev0/dev1>`	Specific devices to use	`auto`
`-ts, --tensor-split <t0/t1>`	Tensor split ratios across GPUs	`0`
`--list-devices`	List available devices and exit	—
`-nkvo, --no-kv-offload <0\\|1>`	Disable KV cache offloading to GPU	`0`
`-fa, --flash-attn <0\\|1>`	Enable flash attention	`0`

CPU & Threading¶

Flag	Description	Default
`-t, --threads <n>`	Number of CPU threads	auto-detect
`-C, --cpu-mask <hex>`	CPU affinity mask	`0x0`
`--cpu-strict <0\\|1>`	Strict CPU affinity	`0`
`--poll <0...100>`	Poll percentage	`50`
`-ncmoe, --n-cpu-moe <n>`	CPU MoE expert count	`0`
`--numa <mode>`	NUMA mode: `distribute`, `isolate`, `numactl`	disabled

KV Cache¶

Flag	Description	Default
`-ctk, --cache-type-k <type>`	KV cache key type: `f16`, `bf16`, `q8_0`, `q4_0`, etc.	`f16`
`-ctv, --cache-type-v <type>`	KV cache value type	`f16`

Output & Control¶

Flag	Description	Default
`-r, --repetitions <n>`	Number of times to repeat each test	`5`
`-o, --output <format>`	Output format: `csv`, `json`, `jsonl`, `md`, `sql`	`md`
`-oe, --output-err <format>`	Stderr output format	`none`
`-v, --verbose`	Verbose output	off
`--progress`	Show progress indicators	off
`--no-warmup`	Skip warmup runs	off
`--prio <-1\\|0\\|1\\|2\\|3>`	Process priority	`0`
`--delay <seconds>`	Delay between tests	`0`

Memory¶

Flag	Description	Default
`-mmp, --mmap <0\\|1>`	Use memory-mapped I/O	`0`
`-dio, --direct-io <0\\|1>`	Use direct I/O	`0`
`-embd, --embeddings <0\\|1>`	Benchmark embedding mode	`0`

Multi-Value Testing¶

You can test multiple configurations in a single run by using comma-separated values:

# Compare GPU vs CPU
tinfer-bench -m model.gguf -ngl 0,99

# Test multiple prompt lengths
tinfer-bench -m model.gguf -p 128,256,512,1024 -ngl 99

# Compare multiple models
tinfer-bench -m model-q4.gguf,model-q8.gguf -ngl 99

# Range syntax: first-last+step
tinfer-bench -m model.gguf -p 128-1024+128 -ngl 99

Export Results¶

# Export as CSV
tinfer-bench -m model.gguf -ngl 99 -o csv > results.csv

# Export as JSON
tinfer-bench -m model.gguf -ngl 99 -o json > results.json

# Export as SQL
tinfer-bench -m model.gguf -ngl 99 -o sql > results.sql