LoRA videos

QLoRA & Lora tweets

Calculate weight size of LLM

(source: [answer.ai](http://nswer.aihttps://www.answer.ai/posts/2024-03-06-fsdp-qlora.html))

QLoRA is a simple but brilliant combination of two critically important advances in modern neural networks: quantization, and LoRA. Quantization is a technique where, instead of using 16 or even 32 bits to store the weights of a neural network, 4 (or even fewer) bits are used. There are only 16 possible values of a 4 bit number, but Dettmers and Zettlemoyer showed that this can be enough in the large language models that are popular today. Tim Dettmers made these 4-bit “quantized” models easy to create, thanks to his bitsandbytes library, and recently Hugging Face has stepped in to help maintain and document this library, particularly thanks to the initiative of Titus von Koeller.

Unfortunately, once a model is quantized, it can not be trained any further with regular approaches – with just 16 possible values, the gradient descent method used for model training will observe zero gradients nearly everywhere, so it can’t make any updates to the quantized weights. This is a major problem, because it means that quantization can only be used for inference, not for continued pre-training or fine-tuning. Whilst inference is useful and important, it’s really just consuming models. But we want everybody to be able to contribute to creating models!

The trick to avoiding this limitation is to use LoRA – “Low-Rank Adaptation of Large Language Models”. LoRA doesn’t train the whole large language model at all, but instead adds “adaptors”, which are very small matrices (generally smaller than 1% of the full model) that are trained, whilst keeping the rest of the model constant. If you’ve played with models like Stable Diffusion, you will have probably seen these adapters many times; it’s how those models are generally shared, and why they are so small and fast to download.

Tim realized that LoRA can be combined with quantization: use a quantized base model, which is not changed at all by the training, and add trainable LoRA adaptors that are not quantized. This combination is QLoRA. Tim’s team was able to use this to, for the first time, train a model that (unquantized) is larger than the GPU: they trained a 65b model (which is 130GB unquantized) on a 48GB card.