Skip to content

GPTQ

The GPT-QModel project (Python package gptqmodel) implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they’re restored to fp16 on the fly during inference. This can save memory usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU’s global memory. Inference is also faster because a lower bitwidth takes less time to communicate.

AutoGPTQ is no longer supported in Transformers. Install GPT-QModel] instead.

Install Accelerate, Transformers and Optimum first.

Terminal window
pip install --upgrade accelerate optimum transformers

Then run the command below to install GPT-QModel].

Terminal window
pip install gptqmodel --no-build-isolation

Create a GPTQConfig class and set the number of bits to quantize to, a dataset to calbrate the weights for quantization, and a tokenizer to prepare the dataset.

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)

You can pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper.

dataset = ["gptqmodel is an easy-to-use model quantization library with user-friendly apis, based on the GPTQ algorithm."]
gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)

Load a model to quantize and pass GPTQConfig to from_pretrained. Set device_map="auto" to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization.

quantized_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", device_map="auto", quantization_config=gptq_config)

If you’re running out of memory because a dataset is too large (disk offloading is not supported), try passing the max_memory parameter to allocate the amount of memory to use on your device (GPU and CPU).

quantized_model = AutoModelForCausalLM.from_pretrained(
"facebook/opt-125m",
device_map="auto",
max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"},
quantization_config=gptq_config
)

Once a model is quantized, you can use push_to_hub to push the model and tokenizer to the Hub where it can be easily shared and accessed. This saves the GPTQConfig.

quantized_model.push_to_hub("opt-125m-gptq")
tokenizer.push_to_hub("opt-125m-gptq")

save_pretrained saves a quantized model locally. If the model was quantized with the device_map parameter, make sure to move the entire model to a GPU or CPU before saving it. The example below saves the model on a CPU.

quantized_model.save_pretrained("opt-125m-gptq")
tokenizer.save_pretrained("opt-125m-gptq")
# if quantized with device_map set
quantized_model.to("cpu")
quantized_model.save_pretrained("opt-125m-gptq")

Reload a quantized model with from_pretrained, and set device_map="auto" to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed.

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")

Marlin is a 4-bit only CUDA GPTQ kernel, highly optimized for the NVIDIA A100 GPU (Ampere) architecture. Loading, dequantization, and execution of post-dequantized weights are highly parallelized, offering a substantial inference improvement versus the original CUDA GPTQ kernel. Marlin is only available for quantized inference and does not support model quantization.

Marlin inference can be activated with the backend parameter in GPTQConfig.

from transformers import AutoModelForCausalLM, GPTQConfig
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=GPTQConfig(bits=4, backend="marlin"))

GPT-QModel] is the actively maintained backend for GPTQ in Transformers. It was originally forked from AutoGPTQ, but has since diverged with significant improvements such as faster quantization, lower memory usage, and more accurate defaults.

GPT-QModel] provides asymmetric quantization which can potentially lower quantization errors compared to symmetric quantization. It is not backward compatible with legacy AutoGPTQ checkpoints, and not all kernels (Marlin) support asymmetric quantization.

GPT-QModel] also has broader support for the latest LLM models, multimodal models (Qwen2-VL and Ovis1.6-VL), platforms (Linux, macOS, Windows 11), and hardware (AMD ROCm, Apple Silicon, Intel/AMD CPUs, and Intel Datacenter Max/Arc GPUs, etc.).

The Marlin kernels are also updated for A100 GPUs and other kernels are updated to include auto-padding for legacy models and models with non-uniform in/out-features.

Run the GPTQ quantization with PEFT notebook for a hands-on experience.