Skip to content

A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities

Update 1: I added tests with 128g + desc_act using ExLlama. They are marked with (new)

Update 2: also added a test for 30b with 128g + desc_act using ExLlama.

Update 3: the takeaway messages have been updated in light of the latest data.

Update 4: added llama-65b.ggmlv3.q2_K (2-bit) test with llama.cpp.


After learning that I could get 1-2 tokens/second for llama-65b on my computer using llama.cpp, I became curious to measure its accuracy. How does it compare to GPTQ?

This led to further questions:

  • ExLlama is a lot faster than AutoGPTQ. Is it as accurate?
  • How does the load_in_4bit bitsandbytes option compare to all of the previous?

The authors of all of those backends take perplexity seriously and have performed their own tests, but I felt like a direct comparison, using not only the same method but also the same code, was lacking. I find this fundamental because small differences in the perplexity evaluation can lead to numbers that are not directly comparable.

How I did it

The idea is to trick the transformers library into thinking that llama.cpp and ExLlama are transformers models, and then evaluate their perplexities.

This is done by creating a wrapper for the model. The first such wrapper was "ExLlama_HF", created by LarryVRH in this PR.

What I did was start from Larry's code and

1) Make ExLlama_HF functional for evaluation.

2) Create a llama.cpp_HF wrapper that is also functional for evaluation.

Each of these took more hours to get working than I am willing to admit, but lo and behold, it worked.

Evaluation setup

All tests are performed inside text-generation-webui. It uses the code here.

The ExLlama tests uses the code in this PR, and the llama.cpp tests use the code in this PR. I haven't merged them yet but they will be in the 1.2 release.

For GPTQ tests, I used models with groupsize 128 and no desc_act, which are the ones that are widely used.

Results

First I will show the results of my personal tests, which are based on the following setup:

  • A .txt input file containing some technical blog posts and papers that I collected. It is a lot smaller and faster to evaluate than wikitext, but I find that it correlates perfectly with bigger evaluations.
  • Context length of 1200 (otherwise llama-30b-4bit-128g with AutoGPTQ runs out of memory on my RTX 3090).
  • Stride length of 512.

These are the numbers:

Model Perplexity Backend
llama-65b.ggmlv3.q4_K_M.bin 4.90639 llama.cpp
llama-65b.ggmlv3.q3_K_M.bin 5.01299 llama.cpp
llama-30b.ggmlv3.q4_K_M.bin 5.21557 llama.cpp
llama-30b 5.24609 transformers with --load-in-4bit --use_double_quant
Neko-Institute-of-Science_LLaMA-30B-4bit-128g (new, with desc_act) 5.25923 ExLlama
llama-30b-4bit-128g 5.30078 AutoGPTQ
llama-65b.ggmlv3.q2_K.bin 5.44745 llama.cpp
llama-13b.ggmlv3.q4_K_M.bin 5.71705 llama.cpp
llama-13b-4bit-128g 5.72581 ExLlama
llama-13b-4bit-128g 5.72656 AutoGPTQ
llama-13b 5.73047 transformers with --load-in-4bit --use_double_quant
llama-13b 5.73047 transformers with --load-in-4bit
Neko-Institute-of-Science_LLaMA-13B-4bit-128g (new, with desc_act) 5.74437 ExLlama
galactica-30b-4bit-128g 6.07812 AutoGPTQ
llama-7b 6.14453 16-bit (no quantization)
facebook_galactica-30b 6.16016 transformers with --load-in-4bit
llama-7b 6.24219 transformers with --load-in-4bit
llama-7b.ggmlv3.q4_K_M.bin 6.26391 llama.cpp
Neko-Institute-of-Science_LLaMA-7B-4bit-128g (new, with desc_act) 6.28790 ExLlama
llama-7b-4bit 6.47835 ExLlama
llama-7b-4bit 6.48438 AutoGPTQ
llama-7b-4bit-128g 6.54463 ExLlama
llama-7b-4bit-128g 6.54688 AutoGPTQ
facebook_galactica-6.7b 6.78906 16-bit (no quantization)
tiiuae_falcon-7b 7.33203 16-bit (no quantization)

As a follow-up, I made a more thorough test with wikitext for llama-13b using 2048 context length and the same 512 stride. This took 2 hours for llama.cpp with all layers offloaded to the GPU. These were the results:

Model Perplexity Backend
llama-13b.ggmlv3.q4_K_M.bin 4.58748 llama.cpp
Neko-Institute-of-Science_LLaMA-13B-4bit-128g (new, with desc_act) 4.60102 ExLlama
llama-13b 4.60156 transformers with --load-in-4bit
llama-13b-4bit-128g 4.66016 ExLlama
llama-13b-4bit-128g 4.66073 AutoGPTQ

Key takeaways

  • For 13b and 30b, llama.cpp q4_K_M wins.
  • The perplexity of llama-65b in llama.cpp is indeed lower than for llama-30b in all other backends.
  • For 7b and 13b, ExLlama is as accurate as AutoGPTQ (a tiny bit lower actually), confirming that its GPTQ reimplementation has been successful.
  • (updated) For GPTQ, you should be using models with groupsize AND desc_act on ExLlama unless you have a specific reason to use something else.
  • (updated) bitsandbytes load_in_4bit vs GPTQ + desc_act: load_in_4bit wins in 3 out of 4 tests, but the difference is not big.

Support My Work

If you appreciate what I do, consider supporting me: