Model Conversion¶
Convert HuggingFace models and LoRA adapters to GGUF format for use with Tinfer.
Source code: The conversion scripts are available in the
conversion/folder of this repository.
Setup¶
Dependencies: PyTorch (CPU-only), transformers, numpy, gguf, sentencepiece, safetensors, protobuf.
1. Convert HuggingFace Model → GGUF¶
Use convert_hf_to_gguf.py to convert any HuggingFace model (safetensors/PyTorch) to GGUF format.
Download a Model First¶
pip install huggingface-hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='HuggingFaceTB/SmolLM-135M-Instruct', local_dir='./models/SmolLM-135M-Instruct')
"
Convert¶
# Basic conversion (auto-detect output type)
python convert_hf_to_gguf.py ./models/SmolLM-135M-Instruct --outfile ./models/converted/SmolLM-135M-q8.gguf --outtype q8_0
# Convert to F16 (for further quantization)
python convert_hf_to_gguf.py ./models/SmolLM-135M-Instruct --outfile ./models/converted/SmolLM-135M-f16.gguf --outtype f16
# Convert to BFloat16
python convert_hf_to_gguf.py ./models/SmolLM-135M-Instruct --outfile ./models/converted/SmolLM-135M-bf16.gguf --outtype bf16
Convert Directly from HuggingFace (No Download)¶
python convert_hf_to_gguf.py --remote HuggingFaceTB/SmolLM-135M-Instruct --outfile SmolLM-135M-q8.gguf --outtype q8_0
Output Types¶
| Type | Description | When to Use |
|---|---|---|
f32 |
Float 32-bit | Maximum quality, largest file |
f16 |
Float 16-bit | Good for further quantization with tinfer-quantize |
bf16 |
BFloat 16-bit | Training checkpoints |
q8_0 |
8-bit quantized | Ready to use, good quality |
auto |
Auto-detect | Uses model's native precision |
Key Flags¶
| Flag | Description |
|---|---|
--outfile <path> |
Output GGUF file path |
--outtype <type> |
Output type: f32, f16, bf16, q8_0, auto |
--remote <model-id> |
Convert directly from HuggingFace without downloading |
--bigendian |
Use big-endian format |
--vocab-only |
Export only the vocabulary |
--model-name <name> |
Override model name in metadata |
--metadata <file> |
Override metadata from JSON file |
--split-max-tensors <n> |
Split output into shards by tensor count |
--split-max-size <n> |
Split output into shards by size |
--dry-run |
Show what would be done without writing |
--no-lazy |
Load all tensors into RAM (uses more memory) |
2. Convert LoRA Adapter → GGUF¶
Use convert_lora_to_gguf.py to convert HuggingFace PEFT LoRA adapters.
Requirements¶
The LoRA adapter folder must contain:
adapter_config.json— adapter configurationadapter_model.safetensors(oradapter_model.bin) — adapter weights
Convert¶
# Auto-detect base model (reads from adapter_config.json)
python convert_lora_to_gguf.py ./my-lora-adapter --outfile lora-adapter.gguf
# Specify base model directory
python convert_lora_to_gguf.py ./my-lora-adapter --base ./base-model-folder --outfile lora.gguf
# Specify base model by HuggingFace ID
python convert_lora_to_gguf.py ./my-lora-adapter --base-model-id meta-llama/Llama-3.2-3B-Instruct --outfile lora.gguf
Key Flags¶
| Flag | Description |
|---|---|
--outfile <path> |
Output GGUF file path |
--outtype <type> |
Output type: f32, f16, bf16, q8_0, auto (default: f32) |
--base <dir> |
Directory with base model config files |
--base-model-id <id> |
HuggingFace model ID for base model config |
--bigendian |
Use big-endian format |
--dry-run |
Show what would be done without writing |
Use the Converted LoRA¶
# CLI with LoRA
tinfer -m base-model.gguf --lora lora-adapter.gguf -p "Hello!"
# Server with LoRA
tinfer-server -m base-model.gguf --lora lora-adapter.gguf --port 8080
Full Pipeline Example¶
# 1. Download model from HuggingFace
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='HuggingFaceTB/SmolLM-135M-Instruct', local_dir='./models/SmolLM-135M')
"
# 2. Convert to GGUF (F16 for max quality conversion)
python convert_hf_to_gguf.py ./models/SmolLM-135M --outfile ./models/SmolLM-135M-f16.gguf --outtype f16
# 3. Quantize to Q4_K_M (best size/quality ratio)
tinfer-quantize ./models/SmolLM-135M-f16.gguf ./models/SmolLM-135M-Q4_K_M.gguf Q4_K_M
# 4. Benchmark
tinfer-bench -m ./models/SmolLM-135M-Q4_K_M.gguf -ngl 99
# 5. Run inference
tinfer -m ./models/SmolLM-135M-Q4_K_M.gguf -p "Hello, how are you?" -n 100