#perplexity

A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities

Update 1: I added tests with 128g + desc_act using ExLlama. They are marked with (new)

Update 2: also added a test for 30b with 128g + desc_act using ExLlama.

Update 3: the takeaway messages have been updated in light of the latest data.

Update 4: added llama-65b.ggmlv3.q2_K (2-bit) test with llama.cpp.

After learning that I could get 1-2 tokens/second for llama-65b on my computer using llama.cpp, I became curious to measure its accuracy. How does it compare to GPTQ?

This led to further questions:

ExLlama is a lot faster than AutoGPTQ. Is it as accurate?
How does the load_in_4bit bitsandbytes option compare to all of the previous?

The authors of all of those backends take perplexity seriously and have performed their own tests, but I felt like a direct comparison, using not only the same method but also the same code, was lacking. I find this fundamental because small differences in the perplexity evaluation can lead to numbers that are not directly comparable.

How I did it

The idea is to trick the transformers library into thinking that llama.cpp and ExLlama are transformers models, and then evaluate their perplexities.

This is done by creating a wrapper for the model. The first such wrapper was "ExLlama_HF", created by LarryVRH in this PR.

What I did was start from Larry's code and

1) Make ExLlama_HF functional for evaluation.

2) Create a llama.cpp_HF wrapper that is also functional for evaluation.

Each of these took more hours to get working than I am willing to admit, but lo and behold, it worked.

Evaluation setup

All tests are performed inside text-generation-webui. It uses the code here.

The ExLlama tests uses the code in this PR, and the llama.cpp tests use the code in this PR. I haven't merged them yet but they will be in the 1.2 release.

For GPTQ tests, I used models with groupsize 128 and no desc_act, which are the ones that are widely used.

Results

First I will show the results of my personal tests, which are based on the following setup:

A .txt input file containing some technical blog posts and papers that I collected. It is a lot smaller and faster to evaluate than wikitext, but I find that it correlates perfectly with bigger evaluations.
Context length of 1200 (otherwise llama-30b-4bit-128g with AutoGPTQ runs out of memory on my RTX 3090).
Stride length of 512.

These are the numbers:

Model	Perplexity	Backend
llama-65b.ggmlv3.q4_K_M.bin	4.90639	llama.cpp
llama-65b.ggmlv3.q3_K_M.bin	5.01299	llama.cpp
llama-30b.ggmlv3.q4_K_M.bin	5.21557	llama.cpp
llama-30b	5.24609	transformers with `--load-in-4bit --use_double_quant`
Neko-Institute-of-Science_LLaMA-30B-4bit-128g (new, with desc_act)	5.25923	ExLlama
llama-30b-4bit-128g	5.30078	AutoGPTQ
llama-65b.ggmlv3.q2_K.bin	5.44745	llama.cpp
llama-13b.ggmlv3.q4_K_M.bin	5.71705	llama.cpp
llama-13b-4bit-128g	5.72581	ExLlama
llama-13b-4bit-128g	5.72656	AutoGPTQ
llama-13b	5.73047	transformers with `--load-in-4bit --use_double_quant`
llama-13b	5.73047	transformers with `--load-in-4bit`
Neko-Institute-of-Science_LLaMA-13B-4bit-128g (new, with desc_act)	5.74437	ExLlama
galactica-30b-4bit-128g	6.07812	AutoGPTQ
llama-7b	6.14453	16-bit (no quantization)
facebook_galactica-30b	6.16016	transformers with `--load-in-4bit`
llama-7b	6.24219	transformers with `--load-in-4bit`
llama-7b.ggmlv3.q4_K_M.bin	6.26391	llama.cpp
Neko-Institute-of-Science_LLaMA-7B-4bit-128g (new, with desc_act)	6.28790	ExLlama
llama-7b-4bit	6.47835	ExLlama
llama-7b-4bit	6.48438	AutoGPTQ
llama-7b-4bit-128g	6.54463	ExLlama
llama-7b-4bit-128g	6.54688	AutoGPTQ
facebook_galactica-6.7b	6.78906	16-bit (no quantization)
tiiuae_falcon-7b	7.33203	16-bit (no quantization)

As a follow-up, I made a more thorough test with wikitext for llama-13b using 2048 context length and the same 512 stride. This took 2 hours for llama.cpp with all layers offloaded to the GPU. These were the results:

Model	Perplexity	Backend
llama-13b.ggmlv3.q4_K_M.bin	4.58748	llama.cpp
Neko-Institute-of-Science_LLaMA-13B-4bit-128g (new, with desc_act)	4.60102	ExLlama
llama-13b	4.60156	transformers with `--load-in-4bit`
llama-13b-4bit-128g	4.66016	ExLlama
llama-13b-4bit-128g	4.66073	AutoGPTQ

Key takeaways

For 13b and 30b, llama.cpp q4_K_M wins.
The perplexity of llama-65b in llama.cpp is indeed lower than for llama-30b in all other backends.
For 7b and 13b, ExLlama is as accurate as AutoGPTQ (a tiny bit lower actually), confirming that its GPTQ reimplementation has been successful.
(updated) For GPTQ, you should be using models with groupsize AND desc_act on ExLlama unless you have a specific reason to use something else.
(updated) bitsandbytes load_in_4bit vs GPTQ + desc_act: load_in_4bit wins in 3 out of 4 tests, but the difference is not big.