LoRA stands for Low-Rank Adaptation. It’s a technique for efficiently fine-tuning large AI models (like language models or image generators) without retraining all of their parameters.
The core idea: instead of updating the millions or billions of weights in a pretrained model, LoRA freezes the original weights and injects small, trainable “adapter” matrices alongside them. These adapters are based on the mathematical insight that the change needed to adapt a model to a new task tends to be “low-rank” — meaning it can be captured by two much smaller matrices multiplied together, rather than one giant one. You only train those small matrices.
Why it matters in practice
- Efficiency — you train and store far fewer parameters (often less than 1% of the model), which drastically cuts memory and compute requirements. Fine-tuning becomes feasible on a single consumer GPU.
- Portability — the resulting adapter is a small file (megabytes rather than gigabytes). You can keep the base model fixed and swap different LoRA adapters in and out for different tasks or styles.
- No inference penalty — the adapter can be merged back into the original weights, so it doesn’t slow down the model when actually used.
Common contexts
You’ll most often hear about LoRA in two contexts: fine-tuning large language models on custom data/tasks, and in image generation (e.g. Stable Diffusion), where people share LoRAs that teach a model a specific character, art style, or subject.
QLoRA variant
There’s also a popular variant called QLoRA, which combines LoRA with quantization (compressing the base model to lower precision) to make fine-tuning even more memory-efficient.
The technique comes from a 2021 paper by Microsoft researchers (“LoRA: Low-Rank Adaptation of Large Language Models”).