Recent developments in low-bit quantization for LLMs, like AQLM and AutoRound, are now showing acceptable levels of degradation in downstream tasks, especially for large models. That said, 2-bit quantization still introduces noticeable accuracy loss in most cases.
One promising algorithm for low-bit quantization is VPTQ (MIT license), proposed by Microsoft. It was introduced in October 2024 and has since shown excellent performance and efficiency in quantizing large models.
In this article, we will:
- Review the VPTQ quantization algorithm.
- Demonstrate how to use VPTQ models, many of which are already available. For instance, we can easily find low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.
- Evaluate these models and discuss the results to understand when VPTQ models can be a good choice for LLMs in production.
Remarkably, 2-bit quantization with VPTQ almost achieves performance comparable to the original 16-bit model on tasks such as MMLU. Moreover, it enables running Llama 3.1 405B on a single GPU, while using less memory than a 70B model!
