Benchmarking
Measure inference speed (tokens/second) for prompt processing and text generation using tinfer-bench.
Quick Start
# Basic benchmark (uses GPU by default)
tinfer-bench -m model.gguf
# Benchmark with all layers on GPU
tinfer-bench -m model.gguf -ngl 99
# Custom prompt/generation lengths
tinfer-bench -m model.gguf -p 512 -n 128 -ngl 99
What It Measures
| Metric |
Description |
| pp (prompt processing) |
Speed of processing the input prompt (tokens/sec) |
| tg (text generation) |
Speed of generating output tokens (tokens/sec) |
Higher values = faster. The output shows model name, size, backend, and throughput with standard deviation.
Example Output
| model | size | params | backend | ngl | test | t/s |
| llama 256M Q8_0 | 136 MiB| 134.52 M| CUDA | 99 | tg128 | 72.41 ± 11.67 |
All Flags
| Flag |
Description |
Default |
-m, --model <file> |
Model file path (.gguf) |
models/7B/ggml-model-q4_0.gguf |
-p, --n-prompt <n> |
Number of prompt tokens to process |
512 |
-n, --n-gen <n> |
Number of tokens to generate |
128 |
-pg <pp,tg> |
Combined prompt + generation test (e.g. -pg 512,128) |
— |
-d, --n-depth <n> |
Context depth for the test |
0 |
Batch Processing
| Flag |
Description |
Default |
-b, --batch-size <n> |
Batch size for prompt processing |
2048 |
-ub, --ubatch-size <n> |
Micro-batch size |
512 |
GPU & Device
| Flag |
Description |
Default |
-ngl, --n-gpu-layers <n> |
Number of layers to offload to GPU |
99 |
-mg, --main-gpu <i> |
Main GPU index |
0 |
-sm, --split-mode <mode> |
Multi-GPU split: none, layer, row |
layer |
-dev, --device <dev0/dev1> |
Specific devices to use |
auto |
-ts, --tensor-split <t0/t1> |
Tensor split ratios across GPUs |
0 |
--list-devices |
List available devices and exit |
— |
-nkvo, --no-kv-offload <0\|1> |
Disable KV cache offloading to GPU |
0 |
-fa, --flash-attn <0\|1> |
Enable flash attention |
0 |
CPU & Threading
| Flag |
Description |
Default |
-t, --threads <n> |
Number of CPU threads |
auto-detect |
-C, --cpu-mask <hex> |
CPU affinity mask |
0x0 |
--cpu-strict <0\|1> |
Strict CPU affinity |
0 |
--poll <0...100> |
Poll percentage |
50 |
-ncmoe, --n-cpu-moe <n> |
CPU MoE expert count |
0 |
--numa <mode> |
NUMA mode: distribute, isolate, numactl |
disabled |
KV Cache
| Flag |
Description |
Default |
-ctk, --cache-type-k <type> |
KV cache key type: f16, bf16, q8_0, q4_0, etc. |
f16 |
-ctv, --cache-type-v <type> |
KV cache value type |
f16 |
Output & Control
| Flag |
Description |
Default |
-r, --repetitions <n> |
Number of times to repeat each test |
5 |
-o, --output <format> |
Output format: csv, json, jsonl, md, sql |
md |
-oe, --output-err <format> |
Stderr output format |
none |
-v, --verbose |
Verbose output |
off |
--progress |
Show progress indicators |
off |
--no-warmup |
Skip warmup runs |
off |
--prio <-1\|0\|1\|2\|3> |
Process priority |
0 |
--delay <seconds> |
Delay between tests |
0 |
Memory
| Flag |
Description |
Default |
-mmp, --mmap <0\|1> |
Use memory-mapped I/O |
0 |
-dio, --direct-io <0\|1> |
Use direct I/O |
0 |
-embd, --embeddings <0\|1> |
Benchmark embedding mode |
0 |
Multi-Value Testing
You can test multiple configurations in a single run by using comma-separated values:
# Compare GPU vs CPU
tinfer-bench -m model.gguf -ngl 0,99
# Test multiple prompt lengths
tinfer-bench -m model.gguf -p 128,256,512,1024 -ngl 99
# Compare multiple models
tinfer-bench -m model-q4.gguf,model-q8.gguf -ngl 99
# Range syntax: first-last+step
tinfer-bench -m model.gguf -p 128-1024+128 -ngl 99
Export Results
# Export as CSV
tinfer-bench -m model.gguf -ngl 99 -o csv > results.csv
# Export as JSON
tinfer-bench -m model.gguf -ngl 99 -o json > results.json
# Export as SQL
tinfer-bench -m model.gguf -ngl 99 -o sql > results.sql