Skip to content

A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.

Update 1: added a mention to GPTQ speed throught ExLlamav2, which I had not originally measured.

Update 2: Gerganov has created a PR on llama.cpp that optimizes the llama.cpp evaluation/processing speeds and should make the values here obsolete. See the numbers and discussion here.


Many repositories and quantization methods are currently available for running large language models on consumer hardware. I wanted to get a better grasp of the strengths and weaknesses of each, so I collected the data and performed the in-depth analysis below.

Setup

My setup is the following:

  • CUDA: 12.1
  • OS: Linux
  • GPU: RTX 3090

These are the relevant package versions:

  • AutoAWQ: 0.1.4
  • bitsandbytes: 0.41.1
  • ExLlama: 0.0.18 (unofficial wheel by jllllll)
  • ExLlamav2: 0.0.6
  • flash-attention: 2.3.2 (used by ExLlamav2 only)
  • llama-cpp-python: 0.2.11
  • transformers: 4.34

Quantizations

I analyzed the following quantized models:

Model Description
llama-2-13b (load_in_4bit) llama-2-13b in HF format loaded with load_in_4bit through the transformers library.
llama-2-13b-AWQ-4bit-128g Created with AutoAWQ, q_group_size=128, w_bit=4, zero_point=True.
llama-2-13b-AWQ-4bit-32g Same as above but with q_group_size=32.
llama-2-13b-EXL2-4.000b Created with ExLlamav2, bits=4, head_bits=6 (default value), wikitext-2-raw-v1 as the calibration file.
llama-2-13b-EXL2-4.125b Same as above but with bits=4.125.
llama-2-13b-EXL2-4.250b Same as above but with bits=4.250.
llama-2-13b-EXL2-4.400b Same as above but with bits=4.400.
llama-2-13b-EXL2-4.650b Same as above but with bits=4.650.
llama-2-13b-EXL2-4.900b Same as above but with bits=4.900.
llama-2-13b-GPTQ-4bit-128g-actorder Created with AutoGPTQ, bits=4, group_size=128, desc_act=True, wikitext-2-raw-v1 as the calibration file. Loaded through ExLlama v1.
llama-2-13b-GPTQ-4bit-32g-actorder Same as above but with group_size=32.
llama-2-13b-Q4_K_M.gguf q4_K_M quant for llama.cpp downloaded from TheBloke.
llama-2-13b-Q4_K_S.gguf q4_K_S quant for llama.cpp downloaded from TheBloke.

I also tried creating AWQ models with zero_point=False, and while that does generate an output model, it cannot be loaded in AutoAWQ (a warning appears telling you that only zero_point=True is supported).

Measurements

For perplexity tests, I used text-generation-webui with the predefined "wikitext" dataset option selected, a stride value of 512, and a context length of 4096.

For VRAM tests, I loaded ExLlama and llama.cpp models with a context length of 1. This makes the models directly comparable to the AWQ and transformers models, for which the cache is not preallocated at load time.

For the speed tests, I generated 800 tokens starting from a prompt with 3200 tokens. The speeds are broken down into two:

  • Prompt processing time (in seconds): time to process the 3200 tokens before starting the generation.
  • Evaluation time (in seconds): time to generate 800 new tokens after finishing the initial processing.

Additionally, I added a tokens/second column, defined as 800 / (evaluation time). That is, it does not take into consideration the prompt processing time.

For GPTQ models, I used ExLlama (v1) as the backend for all measurements. I had previously determined that it is exactly as accurate as AutoGPTQ, and it is a lot faster.

The results

These are the results sorted in ascending perplexity order (lower is better):

Model Perplexity (wikitext) VRAM (GB) Model size (GB) Prompt processing time (3200 tokens) Evaluation time (800 tokens) Loading time Tokens/second
llama-2-13b-EXL2-4.900b 4,30752 9305 7860 1,76 15,37 8,12 52,05
llama-2-13b-EXL2-4.650b 4,32136 9025 7481 1,74 14,17 8,20 56,46
llama-2-13b-AWQ-4bit-32g 4,32522 10567 7624 3,60 20,27 11,45 39,47
llama-2-13b-Q4_K_M.gguf 4,33326 8985 7502 3,73 25,95 9,90 30,83
llama-2-13b-GPTQ-4bit-32g-actorder 4,33805 8701 7633 1,86 18,85 7,58 42,44
llama-2-13b-EXL2-4.400b 4,33843 8591 7104 1,75 14,12 7,53 56,66
llama-2-13b-Q4_K_S.gguf 4,34246 8553 7071 3,68 22,66 9,31 35,30
llama-2-13b-AWQ-4bit-128g 4,34761 9623 6915 3,59 19,70 11,03 40,61
llama-2-13b-EXL2-4.250b 4,34897 8339 6876 1,68 14,06 7,55 56,90
llama-2-13b-GPTQ-4bit-128g-actorder 4,35793 7935 6924 1,85 15,41 6,84 51,91
llama-2-13b (load_in_4bit) 4,36427 8193 24829 3,01 34,70 20,81 23,05
llama-2-13b-EXL2-4.125b 4,36984 8107 6687 1,69 14,13 7,10 56,62
llama-2-13b-EXL2-4.000b 4,37648 7883 6498 1,71 14,09 7,57 56,78

Here is the same data in image format (I find it easier to read):

data

Pareto frontiers

The goal of every quantization method is to simultaneously minimize the size and the perplexity of the model. In this context, the concept of Pareto frontier becomes relevant. A model is said to be at the Pareto frontier if no other model exists with both smaller size and smaller perplexity.

We can make some plots and look for Pareto frontiers to see what models are optimal.

Perplexity vs model size

Two plots tell two complementary stories. The first one is perplexity as a function of VRAM:

download

The second one is perplexity as a function of model size on disk:

download (1)

AWQ

The basic question is "Is it better than GPTQ?". The models have lower perplexity and smaller sizes on disk than their GPTQ counterparts (with the same group size), but their VRAM usages are a lot higher. So, "sort of".

If we ignore VRAM and look at the model size alone, llama-2-13b-EXL2-4.650b dominates llama-2-13b-AWQ-4bit-32g in both size and perplexity, while llama-2-13b-AWQ-4bit-128g and llama-2-13b-EXL2-4.250b are very close to each other and appear simultaneously in the model size vs perplexity Pareto frontier.

GPTQ

The next question is "Is EXL2 better than GPTQ"?

  • llama-2-13b-EXL2-4.250b has lower perplexity than llama-2-13b-GPTQ-4bit-128g-actorder and is smaller (on disk), but it uses more VRAM.
  • llama-2-13b-EXL2-4.650b has lower perplexity than llama-2-13b-GPTQ-4bit-32g-actorder and is smaller (on disk), but it uses more VRAM.

As a consequence, the 4 models above all appear in the VRAM vs perplexity Pareto frontier.

llama.cpp

  • llama-2-13b-Q4_K_S.gguf appears in both Pareto frontiers, so it holds its ground. Its perplexity is between llama-2-13b-EXL2-4.250b and llama-2-13b-EXL2-4.400b.
  • llama-2-13b-Q4_K_M.gguf is dominated by llama-2-13b-EXL2-4.650b in perplexity and model size on disk, but it is not dominated in VRAM due to a 40 MB difference. As a consequence, it is in the VRAM vs perplexity Pareto frontier, but in a way that I would classify as borderline, as the difference in perplexity is more significant than the difference in VRAM.

Overall, I am impressed with the accuracy of the llama.cpp quants. They take only a few minutes to create, vs more than 10x longer for GPTQ, AWQ, or EXL2, so I did not expect them to appear in any Pareto frontier.

Prompt processing speed

Moving on to speeds:

download (2)

EXL2 is the fastest, followed by GPTQ through ExLlama v1. llama.cpp is the slowest, taking 2.22x longer than ExLlamav2 to process a 3200 tokens prompt.

The prompt processing speeds of load_in_4bit and AutoAWQ are not impressive.

Evaluation speed

The following two plots tell the same story:

download (3)

download (4)

When it comes to evaluation speed (the speed of generating tokens after having already processed the prompt), EXL2 is the fastest. load_in_4bit is the slowest, followed by llama.cpp. EXL2 generates 147% more tokens/second than load_in_4bit and 85% more tokens/second than llama.cpp.

ExLlama v1 vs ExLlama v2 GPTQ speed (update)

I had originally measured the GPTQ speeds through ExLlama v1 only, but turboderp pointed out that GPTQ is faster on ExLlama v2, so I collected the following additional data for the model llama-2-13b-hf-GPTQ-4bit-128g-actorder to verify:

Backend Prompt processing (3200 tokens, seconds) Evaluation (800 tokens, seconds) Tokens/second
ExLlama (v1) 1.85 15.34 52.15
ExLlama (v2) 1.68 12.48 64.10

The prompt processing time of 1.68 seconds is identical to the previous record holder, which was llama-2-13b-EXL2-4.250b through ExLlamav2.

Meanwhile, the evaluation time is a record holder: the previous one was llama-2-13b-EXL2-4.250b with 14.06 seconds. So GPTQ through ExLlamav2 is actually the model with the fastest evaluation speed of all, 13% faster than the same model on ExLlama v1.

Loading time

Finally, let's look at the time to load the model:

download (5)

load_in_4bit takes a lot longer because it has to read and convert the 16-bit model on the fly. It is useful to look at the plot without it:

download (6)

In this case, ExLlama v1 is the fastest (the GPTQ model), and AutoAWQ is the slowest.


Support My Work

If you appreciate what I do, consider supporting me:

My LLM work has been supported by a grant from Andreessen Horowitz (a16z), to which I am very grateful.