Skip to content

Model Conversion

Convert HuggingFace models and LoRA adapters to GGUF format for use with Tinfer.

Source code: The conversion scripts are available in the conversion/ folder of this repository.


Setup

cd conversion
pip install -r requirements.txt

Dependencies: PyTorch (CPU-only), transformers, numpy, gguf, sentencepiece, safetensors, protobuf.


1. Convert HuggingFace Model → GGUF

Use convert_hf_to_gguf.py to convert any HuggingFace model (safetensors/PyTorch) to GGUF format.

Download a Model First

pip install huggingface-hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='HuggingFaceTB/SmolLM-135M-Instruct', local_dir='./models/SmolLM-135M-Instruct')
"

Convert

# Basic conversion (auto-detect output type)
python convert_hf_to_gguf.py ./models/SmolLM-135M-Instruct --outfile ./models/converted/SmolLM-135M-q8.gguf --outtype q8_0

# Convert to F16 (for further quantization)
python convert_hf_to_gguf.py ./models/SmolLM-135M-Instruct --outfile ./models/converted/SmolLM-135M-f16.gguf --outtype f16

# Convert to BFloat16
python convert_hf_to_gguf.py ./models/SmolLM-135M-Instruct --outfile ./models/converted/SmolLM-135M-bf16.gguf --outtype bf16

Convert Directly from HuggingFace (No Download)

python convert_hf_to_gguf.py --remote HuggingFaceTB/SmolLM-135M-Instruct --outfile SmolLM-135M-q8.gguf --outtype q8_0

Output Types

Type Description When to Use
f32 Float 32-bit Maximum quality, largest file
f16 Float 16-bit Good for further quantization with tinfer-quantize
bf16 BFloat 16-bit Training checkpoints
q8_0 8-bit quantized Ready to use, good quality
auto Auto-detect Uses model's native precision

Key Flags

Flag Description
--outfile <path> Output GGUF file path
--outtype <type> Output type: f32, f16, bf16, q8_0, auto
--remote <model-id> Convert directly from HuggingFace without downloading
--bigendian Use big-endian format
--vocab-only Export only the vocabulary
--model-name <name> Override model name in metadata
--metadata <file> Override metadata from JSON file
--split-max-tensors <n> Split output into shards by tensor count
--split-max-size <n> Split output into shards by size
--dry-run Show what would be done without writing
--no-lazy Load all tensors into RAM (uses more memory)

2. Convert LoRA Adapter → GGUF

Use convert_lora_to_gguf.py to convert HuggingFace PEFT LoRA adapters.

Requirements

The LoRA adapter folder must contain:

  • adapter_config.json — adapter configuration
  • adapter_model.safetensors (or adapter_model.bin) — adapter weights

Convert

# Auto-detect base model (reads from adapter_config.json)
python convert_lora_to_gguf.py ./my-lora-adapter --outfile lora-adapter.gguf

# Specify base model directory
python convert_lora_to_gguf.py ./my-lora-adapter --base ./base-model-folder --outfile lora.gguf

# Specify base model by HuggingFace ID
python convert_lora_to_gguf.py ./my-lora-adapter --base-model-id meta-llama/Llama-3.2-3B-Instruct --outfile lora.gguf

Key Flags

Flag Description
--outfile <path> Output GGUF file path
--outtype <type> Output type: f32, f16, bf16, q8_0, auto (default: f32)
--base <dir> Directory with base model config files
--base-model-id <id> HuggingFace model ID for base model config
--bigendian Use big-endian format
--dry-run Show what would be done without writing

Use the Converted LoRA

# CLI with LoRA
tinfer -m base-model.gguf --lora lora-adapter.gguf -p "Hello!"

# Server with LoRA
tinfer-server -m base-model.gguf --lora lora-adapter.gguf --port 8080

Full Pipeline Example

# 1. Download model from HuggingFace
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='HuggingFaceTB/SmolLM-135M-Instruct', local_dir='./models/SmolLM-135M')
"

# 2. Convert to GGUF (F16 for max quality conversion)
python convert_hf_to_gguf.py ./models/SmolLM-135M --outfile ./models/SmolLM-135M-f16.gguf --outtype f16

# 3. Quantize to Q4_K_M (best size/quality ratio)
tinfer-quantize ./models/SmolLM-135M-f16.gguf ./models/SmolLM-135M-Q4_K_M.gguf Q4_K_M

# 4. Benchmark
tinfer-bench -m ./models/SmolLM-135M-Q4_K_M.gguf -ngl 99

# 5. Run inference
tinfer -m ./models/SmolLM-135M-Q4_K_M.gguf -p "Hello, how are you?" -n 100