GGUF (GPT-Generated Unified Format) is a binary file format for storing and distributing large language models, designed specifically for efficient local inference.

Background

Introduced by the llama.cpp project in 2023 as a replacement for the older GGML format. The name reflects its origins but it’s now used broadly across many model families beyond GPT.

Key Characteristics

Self-contained — A single .gguf file bundles everything needed to run a model: weights, tokenizer vocabulary, metadata, and architecture config. No separate config files needed.

Quantization-friendly — GGUF is the go-to format for quantized models. Common quantization levels include Q4_K_M, Q5_K_M, Q8_0, etc. Quantization reduces model size and memory requirements by lowering numerical precision (e.g., from 32-bit floats to 4-bit integers), with varying tradeoffs in quality.

CPU + GPU inference — Unlike formats optimized purely for GPU (like safetensors in training pipelines), GGUF models can run efficiently on CPU, with optional GPU offloading for layers that fit in VRAM.

Metadata-rich — The format includes a structured key-value metadata section describing the architecture, context length, rope scaling, and more — making it easier for runtimes to load models correctly without external config.

Why It Matters

It’s the dominant format for running open-weight models locally (Llama, Mistral, Phi, Gemma, etc.) using tools like:

  • llama.cpp — the reference runtime
  • Ollama — wraps llama.cpp for a Docker-like local model experience
  • LM Studio — GUI for running GGUF models
  • Jan — another local inference UI

Typical Filename Anatomy

A filename like Meta-Llama-3-8B-Instruct.Q4_K_M.gguf tells you:

  • Model family and size: Llama 3, 8B parameters
  • Variant: Instruct-tuned
  • Quantization: Q4_K_M (4-bit, K-quant, medium quality)

It’s essentially the standard packaging format for the local LLM ecosystem.