A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities
Update 1: I added tests with 128g + desc_act using ExLlama. They are marked with (new)
Update 2: also added a test for 30b with 128g + desc_act using ExLlama.
Update 3: the takeaway messages have been updated in light of the latest data.
Update 4: added llama-65b.ggmlv3.q2_K (2-bit) test with llama.cpp.
After learning that I could get 1-2 tokens/second for llama-65b on my computer using llama.cpp, I became curious to measure its accuracy. How does it compare to GPTQ?
This led to further questions:
- ExLlama is a lot faster than AutoGPTQ. Is it as accurate?
- How does the
load_in_4bit
bitsandbytes option compare to all of the previous?
The authors of all of those backends take perplexity seriously and have performed their own tests, but I felt like a direct comparison, using not only the same method but also the same code, was lacking. I find this fundamental because small differences in the perplexity evaluation can lead to numbers that are not directly comparable.
How I did it
The idea is to trick the transformers library into thinking that llama.cpp and ExLlama are transformers models, and then evaluate their perplexities.
This is done by creating a wrapper for the model. The first such wrapper was "ExLlama_HF", created by LarryVRH in this PR.
What I did was start from Larry's code and
1) Make ExLlama_HF functional for evaluation.
2) Create a llama.cpp_HF wrapper that is also functional for evaluation.
Each of these took more hours to get working than I am willing to admit, but lo and behold, it worked.
Evaluation setup
All tests are performed inside text-generation-webui. It uses the code here.
The ExLlama tests uses the code in this PR, and the llama.cpp tests use the code in this PR. I haven't merged them yet but they will be in the 1.2 release.
For GPTQ tests, I used models with groupsize 128 and no desc_act, which are the ones that are widely used.
Results
First I will show the results of my personal tests, which are based on the following setup:
- A .txt input file containing some technical blog posts and papers that I collected. It is a lot smaller and faster to evaluate than wikitext, but I find that it correlates perfectly with bigger evaluations.
- Context length of 1200 (otherwise llama-30b-4bit-128g with AutoGPTQ runs out of memory on my RTX 3090).
- Stride length of 512.
These are the numbers:
Model | Perplexity | Backend |
---|---|---|
llama-65b.ggmlv3.q4_K_M.bin | 4.90639 | llama.cpp |
llama-65b.ggmlv3.q3_K_M.bin | 5.01299 | llama.cpp |
llama-30b.ggmlv3.q4_K_M.bin | 5.21557 | llama.cpp |
llama-30b | 5.24609 | transformers with --load-in-4bit --use_double_quant |
Neko-Institute-of-Science_LLaMA-30B-4bit-128g (new, with desc_act) | 5.25923 | ExLlama |
llama-30b-4bit-128g | 5.30078 | AutoGPTQ |
llama-65b.ggmlv3.q2_K.bin | 5.44745 | llama.cpp |
llama-13b.ggmlv3.q4_K_M.bin | 5.71705 | llama.cpp |
llama-13b-4bit-128g | 5.72581 | ExLlama |
llama-13b-4bit-128g | 5.72656 | AutoGPTQ |
llama-13b | 5.73047 | transformers with --load-in-4bit --use_double_quant |
llama-13b | 5.73047 | transformers with --load-in-4bit |
Neko-Institute-of-Science_LLaMA-13B-4bit-128g (new, with desc_act) | 5.74437 | ExLlama |
galactica-30b-4bit-128g | 6.07812 | AutoGPTQ |
llama-7b | 6.14453 | 16-bit (no quantization) |
facebook_galactica-30b | 6.16016 | transformers with --load-in-4bit |
llama-7b | 6.24219 | transformers with --load-in-4bit |
llama-7b.ggmlv3.q4_K_M.bin | 6.26391 | llama.cpp |
Neko-Institute-of-Science_LLaMA-7B-4bit-128g (new, with desc_act) | 6.28790 | ExLlama |
llama-7b-4bit | 6.47835 | ExLlama |
llama-7b-4bit | 6.48438 | AutoGPTQ |
llama-7b-4bit-128g | 6.54463 | ExLlama |
llama-7b-4bit-128g | 6.54688 | AutoGPTQ |
facebook_galactica-6.7b | 6.78906 | 16-bit (no quantization) |
tiiuae_falcon-7b | 7.33203 | 16-bit (no quantization) |
As a follow-up, I made a more thorough test with wikitext for llama-13b using 2048 context length and the same 512 stride. This took 2 hours for llama.cpp with all layers offloaded to the GPU. These were the results:
Model | Perplexity | Backend |
---|---|---|
llama-13b.ggmlv3.q4_K_M.bin | 4.58748 | llama.cpp |
Neko-Institute-of-Science_LLaMA-13B-4bit-128g (new, with desc_act) | 4.60102 | ExLlama |
llama-13b | 4.60156 | transformers with --load-in-4bit |
llama-13b-4bit-128g | 4.66016 | ExLlama |
llama-13b-4bit-128g | 4.66073 | AutoGPTQ |
Key takeaways
- For 13b and 30b, llama.cpp q4_K_M wins.
- The perplexity of llama-65b in llama.cpp is indeed lower than for llama-30b in all other backends.
- For 7b and 13b, ExLlama is as accurate as AutoGPTQ (a tiny bit lower actually), confirming that its GPTQ reimplementation has been successful.
- (updated) For GPTQ, you should be using models with groupsize AND desc_act on ExLlama unless you have a specific reason to use something else.
- (updated) bitsandbytes
load_in_4bit
vs GPTQ + desc_act:load_in_4bit
wins in 3 out of 4 tests, but the difference is not big.
Support My Work
If you appreciate what I do, consider supporting me: