Benchmarks for lots of quantization types in llama.cpp
Recently, I noticed that lots of new quantization types were added to llama.cpp. So just curious, I decided to some simple tests on every llama.cpp’s quantization types. There are total 27 types of quantization in llama.cpp including F16 and F32.
Measurement Setup
The commit hash of llama.cpp, which was used for this measument, is d5ab2975, also tag b2296.
The model used for this measurement is meta-llama/Llama-2-7b-chat-hf.
The data used to generate imatrix calibration data for this measurement is 20k_random_data.txt from Importance matrix calculations work best on near-random data #5006. The imatrix calibration data was used only for quantiziong I-Quant like IQ2_XS, IQ4_NL, or etc.
The computers used in this measurement are M1 Max Mac Studio (32 GPU 64GB) and Linux machone with Intel 12400F and NVIDIA 4060 Ti (CUDA 12.3) on Ubuntu 22.04.
The perplexity is measured using wikitext2-raw-v1/train from wikitext.
Perplexity
I measured it with perplexity
included in llama.cpp
.
quant type | ppl |
---|---|
F32 | 7.4924 +/- 0.05038 |
F16 | 7.4924 +/- 0.05038 |
Q8_0 | 7.4933 +/- 0.05040 |
Q6_K | 7.4950 +/- 0.05042 |
Q5_1 | 7.5084 +/- 0.05049 |
Q5_K_M | 7.5099 +/- 0.05051 |
Q5_K_S | 7.5180 +/- 0.05059 |
IQ4_XS | 7.5231 +/- 0.05021 |
IQ4_NL | 7.5392 +/- 0.05044 |
Q4_K_M | 7.5692 +/- 0.05087 |
Q4_1 | 7.5913 +/- 0.05104 |
Q4_K_S | 7.6066 +/- 0.05119 |
Q4_0 | 7.6261 +/- 0.05130 |
Q3_K_L | 7.6491 +/- 0.05110 |
Q3_K_M | 7.6854 +/- 0.05128 |
IQ3_M | 7.7695 +/- 0.05262 |
IQ3_S | 7.7904 +/- 0.05252 |
IQ3_XS | 7.8787 +/- 0.05295 |
Q3_K_S | 8.0321 +/- 0.05409 |
IQ3_XXS | 8.2039 +/- 0.05497 |
IQ2_M | 8.6002 +/- 0.05749 |
Q2_K | 8.6501 +/- 0.05852 |
IQ2_S | 9.1459 +/- 0.06077 |
Q2_K_S | 9.1756 +/- 0.06047 |
IQ2_XS | 9.7873 +/- 0.06424 |
IQ2_XXS | 11.0326 +/- 0.07234 |
IQ1_S | 28.7926 +/- 0.19637 |
llama-bench on M1 Max 32 GPU
I didn’t buy my M1 Max Mac Studio to run LLMs, but I can use up to 48GB of VRAM, so I prefer it. Sorted in descending order by size.
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B all F32 | 25.10 GiB | 6.74 B | Metal | 99 | pp 512 | 549.31 ± 1.32 |
llama 7B all F32 | 25.10 GiB | 6.74 B | Metal | 99 | tg 128 | 12.60 ± 0.08 |
llama 7B F16 | 12.55 GiB | 6.74 B | Metal | 99 | pp 512 | 603.07 ± 4.25 |
llama 7B F16 | 12.55 GiB | 6.74 B | Metal | 99 | tg 128 | 23.18 ± 0.10 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | Metal | 99 | pp 512 | 547.28 ± 0.46 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | Metal | 99 | tg 128 | 40.38 ± 0.01 |
llama 7B Q6_K | 5.15 GiB | 6.74 B | Metal | 99 | pp 512 | 444.28 ± 2.46 |
llama 7B Q6_K | 5.15 GiB | 6.74 B | Metal | 99 | tg 128 | 39.03 ± 0.20 |
llama 7B Q5_1 | 4.72 GiB | 6.74 B | Metal | 99 | pp 512 | 455.69 ± 3.64 |
llama 7B Q5_1 | 4.72 GiB | 6.74 B | Metal | 99 | tg 128 | 41.92 ± 0.18 |
llama 7B Q5_K - Medium | 4.45 GiB | 6.74 B | Metal | 99 | pp 512 | 428.63 ± 0.80 |
llama 7B Q5_K - Medium | 4.45 GiB | 6.74 B | Metal | 99 | tg 128 | 39.67 ± 0.05 |
llama 7B Q5_K - Small | 4.33 GiB | 6.74 B | Metal | 99 | pp 512 | 422.92 ± 1.60 |
llama 7B Q5_K - Small | 4.33 GiB | 6.74 B | Metal | 99 | tg 128 | 40.34 ± 0.25 |
llama 7B Q4_1 | 3.95 GiB | 6.74 B | Metal | 99 | pp 512 | 534.08 ± 1.73 |
llama 7B Q4_1 | 3.95 GiB | 6.74 B | Metal | 99 | tg 128 | 57.19 ± 0.51 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 99 | pp 512 | 469.91 ± 2.67 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 99 | tg 128 | 49.24 ± 0.30 |
llama 7B Q4_K - Small | 3.59 GiB | 6.74 B | Metal | 99 | pp 512 | 474.25 ± 0.59 |
llama 7B Q4_K - Small | 3.59 GiB | 6.74 B | Metal | 99 | tg 128 | 52.02 ± 0.50 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | Metal | 99 | pp 512 | 536.36 ± 2.01 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | Metal | 99 | tg 128 | 61.39 ± 0.59 |
llama 7B IQ4_NL - 4.5 bpw | 3.56 GiB | 6.74 B | Metal | 99 | pp 512 | 520.62 ± 0.52 |
llama 7B IQ4_NL - 4.5 bpw | 3.56 GiB | 6.74 B | Metal | 99 | tg 128 | 53.15 ± 0.05 |
llama 7B IQ4_XS - 4.25 bpw | 3.37 GiB | 6.74 B | Metal | 99 | pp 512 | 491.69 ± 0.44 |
llama 7B IQ4_XS - 4.25 bpw | 3.37 GiB | 6.74 B | Metal | 99 | tg 128 | 51.86 ± 0.06 |
llama 7B Q3_K - Large | 3.35 GiB | 6.74 B | Metal | 99 | pp 512 | 449.53 ± 0.64 |
llama 7B Q3_K - Large | 3.35 GiB | 6.74 B | Metal | 99 | tg 128 | 40.93 ± 0.02 |
llama 7B Q3_K - Medium | 3.07 GiB | 6.74 B | Metal | 99 | pp 512 | 469.64 ± 0.61 |
llama 7B Q3_K - Medium | 3.07 GiB | 6.74 B | Metal | 99 | tg 128 | 44.19 ± 0.01 |
llama 7B IQ3_S mix - 3.66 bpw | 2.90 GiB | 6.74 B | Metal | 99 | pp 512 | 459.51 ± 0.40 |
llama 7B IQ3_S mix - 3.66 bpw | 2.90 GiB | 6.74 B | Metal | 99 | tg 128 | 47.68 ± 0.04 |
llama 7B IQ3_S - 3.4375 bpw | 2.75 GiB | 6.74 B | Metal | 99 | pp 512 | 456.12 ± 0.32 |
llama 7B IQ3_S - 3.4375 bpw | 2.75 GiB | 6.74 B | Metal | 99 | tg 128 | 47.74 ± 0.04 |
llama 7B Q3_K - Small | 2.75 GiB | 6.74 B | Metal | 99 | pp 512 | 466.80 ± 0.27 |
llama 7B Q3_K - Small | 2.75 GiB | 6.74 B | Metal | 99 | tg 128 | 42.60 ± 0.05 |
llama 7B IQ3_XS - 3.3 bpw | 2.60 GiB | 6.74 B | Metal | 99 | pp 512 | 464.88 ± 0.42 |
llama 7B IQ3_XS - 3.3 bpw | 2.60 GiB | 6.74 B | Metal | 99 | tg 128 | 47.87 ± 0.05 |
llama 7B IQ3_XXS - 3.0625 bpw | 2.41 GiB | 6.74 B | Metal | 99 | pp 512 | 470.03 ± 0.38 |
llama 7B IQ3_XXS - 3.0625 bpw | 2.41 GiB | 6.74 B | Metal | 99 | tg 128 | 46.44 ± 0.07 |
llama 7B Q2_K - Medium | 2.36 GiB | 6.74 B | Metal | 99 | pp 512 | 493.38 ± 0.42 |
llama 7B Q2_K - Medium | 2.36 GiB | 6.74 B | Metal | 99 | tg 128 | 51.57 ± 0.11 |
llama 7B IQ2_M - 2.7 bpw | 2.20 GiB | 6.74 B | Metal | 99 | pp 512 | 462.36 ± 0.45 |
llama 7B IQ2_M - 2.7 bpw | 2.20 GiB | 6.74 B | Metal | 99 | tg 128 | 38.38 ± 0.03 |
llama 7B Q2_K - Small | 2.16 GiB | 6.74 B | Metal | 99 | pp 512 | 512.21 ± 0.57 |
llama 7B Q2_K - Small | 2.16 GiB | 6.74 B | Metal | 99 | tg 128 | 62.49 ± 0.20 |
llama 7B IQ2_S - 2.5 bpw | 2.05 GiB | 6.74 B | Metal | 99 | pp 512 | 469.49 ± 1.99 |
llama 7B IQ2_S - 2.5 bpw | 2.05 GiB | 6.74 B | Metal | 99 | tg 128 | 47.32 ± 0.26 |
llama 7B IQ2_XS - 2.3125 bpw | 1.89 GiB | 6.74 B | Metal | 99 | pp 512 | 472.16 ± 0.41 |
llama 7B IQ2_XS - 2.3125 bpw | 1.89 GiB | 6.74 B | Metal | 99 | tg 128 | 48.42 ± 0.03 |
llama 7B IQ2_XXS - 2.0625 bpw | 1.73 GiB | 6.74 B | Metal | 99 | pp 512 | 462.08 ± 0.30 |
llama 7B IQ2_XXS - 2.0625 bpw | 1.73 GiB | 6.74 B | Metal | 99 | tg 128 | 50.45 ± 0.03 |
llama 7B IQ1_S - 1.5625 bpw | 1.42 GiB | 6.74 B | Metal | 99 | pp 512 | 499.09 ± 0.56 |
llama 7B IQ1_S - 1.5625 bpw | 1.42 GiB | 6.74 B | Metal | 99 | tg 128 | 53.01 ± 0.33 |
It’s quite long, so I did some very simple analysis on it.
Best Prompt Processing and Token Generation on M1 Max 32 GPU
I multiplied pp and tg, and sorted it in descending order. Considering pp and tg, Q4_0 is best. Because the list is too long, I put only the top 10 of it. Result of F32 was dropped because actually nobody runs F32 model with llama.cpp.
quant type | model | size GiB | pp 512 t/s | tg 128 t/s | ppl | pp * tg |
---|---|---|---|---|---|---|
Q4_0 | llama 7B Q4_0 | 3.57 | 536.36 | 61.39 | 7.6261 | 32927.1404 |
Q2_K_S | llama 7B Q2_K - Small | 2.16 | 512.21 | 62.49 | 9.1756 | 32008.0029 |
Q4_1 | llama 7B Q4_1 | 3.95 | 534.08 | 57.19 | 7.5913 | 30544.0352 |
IQ4_NL | llama 7B IQ4_NL - 4.5 bpw | 3.56 | 520.62 | 53.15 | 7.5392 | 27670.953 |
IQ1_S | llama 7B IQ1_S - 1.5625 bpw | 1.42 | 499.09 | 53.01 | 28.7926 | 26456.7609 |
IQ4_XS | llama 7B IQ4_XS - 4.25 bpw | 3.37 | 491.69 | 51.86 | 7.5231 | 25499.0434 |
Q2_K | llama 7B Q2_K - Medium | 2.36 | 493.38 | 51.57 | 8.6501 | 25443.6066 |
Q4_K_S | llama 7B Q4_K - Small | 3.59 | 474.25 | 52.02 | 7.6066 | 24670.485 |
IQ2_XXS | llama 7B IQ2_XXS - 2.0625 bpw | 1.73 | 462.08 | 50.45 | 11.0326 | 23311.936 |
Q4_K_M | llama 7B Q4_K - Medium | 3.8 | 469.91 | 49.24 | 7.5692 | 23138.3684 |
Best Prompt Processing and Token Generation over Perplexity on M1 Max 32 GPU
The lower the perplexity, the better. So I divided pp * tg by ppl and sorted it in descending order. Still, Q4_0 is the best but the rest of the list is changed. The rank of Q2_K_S is high though its perplexity is bad.
quant type | model | size GiB | pp 512 t/s | tg 128 t/s | ppl | pp * tg / ppl |
---|---|---|---|---|---|---|
Q4_0 | llama 7B Q4_0 | 3.57 | 536.36 | 61.39 | 7.6261 | 4317.69062 |
Q4_1 | llama 7B Q4_1 | 3.95 | 534.08 | 57.19 | 7.5913 | 4023.55791 |
IQ4_NL | llama 7B IQ4_NL - 4.5 bpw | 3.56 | 520.62 | 53.15 | 7.5392 | 3670.27709 |
Q2_K_S | llama 7B Q2_K - Small | 2.16 | 512.21 | 62.49 | 9.1756 | 3488.38255 |
IQ4_XS | llama 7B IQ4_XS - 4.25 bpw | 3.37 | 491.69 | 51.86 | 7.5231 | 3389.433 |
Q4_K_S | llama 7B Q4_K - Small | 3.59 | 474.25 | 52.02 | 7.6066 | 3243.2999 |
Q4_K_M | llama 7B Q4_K - Medium | 3.8 | 469.91 | 49.24 | 7.5692 | 3056.91069 |
Q8_0 | llama 7B Q8_0 | 6.67 | 547.28 | 40.38 | 7.4933 | 2949.19013 |
Q2_K | llama 7B Q2_K - Medium | 2.36 | 493.38 | 51.57 | 8.6501 | 2941.42341 |
IQ3_XS | llama 7B IQ3_XS - 3.3 bpw | 2.6 | 464.88 | 47.87 | 7.8787 | 2824.55298 |
llama-bench on NVIDIA 4060 Ti 16GB
Same as M1 Max, this table is sorted by size in descending order. Because lack of VRAM, there is no F32 result.
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 512 | 3141.27 ± 47.64 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | tg 128 | 19.55 ± 0.00 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | pp 512 | 2105.29 ± 1.55 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | tg 128 | 35.52 ± 0.01 |
llama 7B Q6_K | 5.15 GiB | 6.74 B | CUDA | 99 | pp 512 | 2152.08 ± 1.97 |
llama 7B Q6_K | 5.15 GiB | 6.74 B | CUDA | 99 | tg 128 | 44.75 ± 0.02 |
llama 7B Q5_1 | 4.72 GiB | 6.74 B | CUDA | 99 | pp 512 | 2174.41 ± 1.67 |
llama 7B Q5_1 | 4.72 GiB | 6.74 B | CUDA | 99 | tg 128 | 48.58 ± 0.02 |
llama 7B Q5_K - Medium | 4.45 GiB | 6.74 B | CUDA | 99 | pp 512 | 2189.18 ± 2.62 |
llama 7B Q5_K - Medium | 4.45 GiB | 6.74 B | CUDA | 99 | tg 128 | 51.01 ± 0.02 |
llama 7B Q5_K - Small | 4.33 GiB | 6.74 B | CUDA | 99 | pp 512 | 2195.82 ± 2.89 |
llama 7B Q5_K - Small | 4.33 GiB | 6.74 B | CUDA | 99 | tg 128 | 52.30 ± 0.02 |
llama 7B Q4_1 | 3.95 GiB | 6.74 B | CUDA | 99 | pp 512 | 2203.94 ± 2.60 |
llama 7B Q4_1 | 3.95 GiB | 6.74 B | CUDA | 99 | tg 128 | 56.83 ± 0.03 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | pp 512 | 2208.96 ± 2.55 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | tg 128 | 58.49 ± 0.03 |
llama 7B Q4_K - Small | 3.59 GiB | 6.74 B | CUDA | 99 | pp 512 | 2218.12 ± 2.64 |
llama 7B Q4_K - Small | 3.59 GiB | 6.74 B | CUDA | 99 | tg 128 | 61.44 ± 0.03 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | CUDA | 99 | pp 512 | 2219.31 ± 1.77 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | CUDA | 99 | tg 128 | 61.80 ± 0.03 |
llama 7B IQ4_NL - 4.5 bpw | 3.56 GiB | 6.74 B | CUDA | 99 | pp 512 | 2215.14 ± 1.76 |
llama 7B IQ4_NL - 4.5 bpw | 3.56 GiB | 6.74 B | CUDA | 99 | tg 128 | 61.70 ± 0.03 |
llama 7B IQ4_XS - 4.25 bpw | 3.37 GiB | 6.74 B | CUDA | 99 | pp 512 | 2220.83 ± 1.47 |
llama 7B IQ4_XS - 4.25 bpw | 3.37 GiB | 6.74 B | CUDA | 99 | tg 128 | 64.79 ± 0.03 |
llama 7B Q3_K - Large | 3.35 GiB | 6.74 B | CUDA | 99 | pp 512 | 2217.48 ± 2.38 |
llama 7B Q3_K - Large | 3.35 GiB | 6.74 B | CUDA | 99 | tg 128 | 64.82 ± 0.03 |
llama 7B Q3_K - Medium | 3.07 GiB | 6.74 B | CUDA | 99 | pp 512 | 2229.58 ± 2.79 |
llama 7B Q3_K - Medium | 3.07 GiB | 6.74 B | CUDA | 99 | tg 128 | 69.80 ± 0.03 |
llama 7B IQ3_S mix - 3.66 bpw | 2.90 GiB | 6.74 B | CUDA | 99 | pp 512 | 2231.57 ± 2.73 |
llama 7B IQ3_S mix - 3.66 bpw | 2.90 GiB | 6.74 B | CUDA | 99 | tg 128 | 73.05 ± 0.05 |
llama 7B IQ3_S - 3.4375 bpw | 2.75 GiB | 6.74 B | CUDA | 99 | pp 512 | 2238.22 ± 2.44 |
llama 7B IQ3_S - 3.4375 bpw | 2.75 GiB | 6.74 B | CUDA | 99 | tg 128 | 76.30 ± 0.04 |
llama 7B Q3_K - Small | 2.75 GiB | 6.74 B | CUDA | 99 | pp 512 | 2241.64 ± 2.73 |
llama 7B Q3_K - Small | 2.75 GiB | 6.74 B | CUDA | 99 | tg 128 | 75.79 ± 0.02 |
llama 7B IQ3_XS - 3.3 bpw | 2.60 GiB | 6.74 B | CUDA | 99 | pp 512 | 2242.87 ± 2.71 |
llama 7B IQ3_XS - 3.3 bpw | 2.60 GiB | 6.74 B | CUDA | 99 | tg 128 | 79.56 ± 0.05 |
llama 7B IQ3_XXS - 3.0625 bpw | 2.41 GiB | 6.74 B | CUDA | 99 | pp 512 | 2251.70 ± 3.08 |
llama 7B IQ3_XXS - 3.0625 bpw | 2.41 GiB | 6.74 B | CUDA | 99 | tg 128 | 84.73 ± 0.04 |
llama 7B Q2_K - Medium | 2.36 GiB | 6.74 B | CUDA | 99 | pp 512 | 2260.54 ± 2.69 |
llama 7B Q2_K - Medium | 2.36 GiB | 6.74 B | CUDA | 99 | tg 128 | 86.31 ± 0.04 |
llama 7B IQ2_M - 2.7 bpw | 2.20 GiB | 6.74 B | CUDA | 99 | pp 512 | 2260.25 ± 2.62 |
llama 7B IQ2_M - 2.7 bpw | 2.20 GiB | 6.74 B | CUDA | 99 | tg 128 | 91.42 ± 0.07 |
llama 7B Q2_K - Small | 2.16 GiB | 6.74 B | CUDA | 99 | pp 512 | 2266.25 ± 1.72 |
llama 7B Q2_K - Small | 2.16 GiB | 6.74 B | CUDA | 99 | tg 128 | 93.02 ± 0.07 |
llama 7B IQ2_S - 2.5 bpw | 2.05 GiB | 6.74 B | CUDA | 99 | pp 512 | 2265.55 ± 2.17 |
llama 7B IQ2_S - 2.5 bpw | 2.05 GiB | 6.74 B | CUDA | 99 | tg 128 | 96.16 ± 0.08 |
llama 7B IQ2_XS - 2.3125 bpw | 1.89 GiB | 6.74 B | CUDA | 99 | pp 512 | 2274.67 ± 1.97 |
llama 7B IQ2_XS - 2.3125 bpw | 1.89 GiB | 6.74 B | CUDA | 99 | tg 128 | 101.77 ± 0.08 |
llama 7B IQ2_XXS - 2.0625 bpw | 1.73 GiB | 6.74 B | CUDA | 99 | pp 512 | 2288.92 ± 2.73 |
llama 7B IQ2_XXS - 2.0625 bpw | 1.73 GiB | 6.74 B | CUDA | 99 | tg 128 | 78.62 ± 0.09 |
llama 7B IQ1_S - 1.5625 bpw | 1.42 GiB | 6.74 B | CUDA | 99 | pp 512 | 2306.16 ± 1.44 |
llama 7B IQ1_S - 1.5625 bpw | 1.42 GiB | 6.74 B | CUDA | 99 | tg 128 | 114.36 ± 0.12 |
Best Prompt Processing and Token Generation on NVIDIA 4060 Ti 16GB
As you can see, I-Quant types are faster than M1 Max.
quant type | model | size GiB | pp 512 t/s | tg 128 t/s | ppl | pp * tg |
---|---|---|---|---|---|---|
IQ1_S | llama 7B IQ1_S - 1.5625 bpw | 1.42 | 2306.16 | 114.36 | 28.7926 | 263732.458 |
IQ2_XS | llama 7B IQ2_XS - 2.3125 bpw | 1.89 | 2274.67 | 101.77 | 9.7873 | 231493.166 |
IQ2_S | llama 7B IQ2_S - 2.5 bpw | 2.05 | 2265.55 | 96.16 | 9.1459 | 217855.288 |
Q2_K_S | llama 7B Q2_K - Small | 2.16 | 2266.25 | 93.02 | 9.1756 | 210806.575 |
IQ2_M | llama 7B IQ2_M - 2.7 bpw | 2.2 | 2260.25 | 91.42 | 8.6002 | 206632.055 |
Q2_K | llama 7B Q2_K - Medium | 2.36 | 2260.54 | 86.31 | 8.6501 | 195107.207 |
IQ3_XXS | llama 7B IQ3_XXS - 3.0625 bpw | 2.41 | 2251.7 | 84.73 | 8.2039 | 190786.541 |
IQ2_XXS | llama 7B IQ2_XXS - 2.0625 bpw | 1.73 | 2288.92 | 78.62 | 11.0326 | 179954.89 |
IQ3_XS | llama 7B IQ3_XS - 3.3 bpw | 2.6 | 2242.87 | 79.56 | 7.8787 | 178442.737 |
IQ3_S | llama 7B IQ3_S - 3.4375 bpw | 2.75 | 2238.22 | 76.3 | 7.7904 | 170776.186 |
Best Prompt Processing and Token Generation over Perplexity on NVIDIA 4060 Ti 16GB
The top ten list of NVIDIA’s are showing higher perplexity than M1 Max.
quant type | model | size GiB | pp 512 t/s | tg 128 t/s | ppl | pp * tg / ppl |
---|---|---|---|---|---|---|
IQ2_M | llama 7B IQ2_M - 2.7 bpw | 2.2 | 2260.25 | 91.42 | 8.6002 | 24026.42439 |
IQ2_S | llama 7B IQ2_S - 2.5 bpw | 2.05 | 2265.55 | 96.16 | 9.1459 | 23819.99453 |
IQ2_XS | llama 7B IQ2_XS - 2.3125 bpw | 1.89 | 2274.67 | 101.77 | 9.7873 | 23652.40321 |
IQ3_XXS | llama 7B IQ3_XXS - 3.0625 bpw | 2.41 | 2251.7 | 84.73 | 8.2039 | 23255.59076 |
Q2_K_S | llama 7B Q2_K - Small | 2.16 | 2266.25 | 93.02 | 9.1756 | 22974.69103 |
IQ3_XS | llama 7B IQ3_XS - 3.3 bpw | 2.6 | 2242.87 | 79.56 | 7.8787 | 22648.75388 |
Q2_K | llama 7B Q2_K - Medium | 2.36 | 2260.54 | 86.31 | 8.6501 | 22555.48576 |
IQ3_S | llama 7B IQ3_S - 3.4375 bpw | 2.75 | 2238.22 | 76.3 | 7.7904 | 21921.36296 |
Q3_K_S | llama 7B Q3_K - Small | 2.75 | 2241.64 | 75.79 | 8.0321 | 21151.86509 |
IQ3_M | llama 7B IQ3_S mix - 3.66 bpw | 2.9 | 2231.57 | 73.05 | 7.7695 | 20981.5546 |
Best Prompt Processing and Token Generation over Perplexity and Size on NVIDIA 4060 Ti 16GB
16GB of VRAM is not so big, so I divided pp * tg / ppl by size.
quant type | model | size GiB | pp 512 t/s | tg 128 t/s | ppl | pp * tg / ppl / size |
---|---|---|---|---|---|---|
IQ2_XS | llama 7B IQ2_XS - 2.3125 bpw | 1.89 | 2274.67 | 101.77 | 9.7873 | 12514.49905 |
IQ2_S | llama 7B IQ2_S - 2.5 bpw | 2.05 | 2265.55 | 96.16 | 9.1459 | 11619.50953 |
IQ2_M | llama 7B IQ2_M - 2.7 bpw | 2.2 | 2260.25 | 91.42 | 8.6002 | 10921.10199 |
Q2_K_S | llama 7B Q2_K - Small | 2.16 | 2266.25 | 93.02 | 9.1756 | 10636.43103 |
IQ3_XXS | llama 7B IQ3_XXS - 3.0625 bpw | 2.41 | 2251.7 | 84.73 | 8.2039 | 9649.62272 |
Q2_K | llama 7B Q2_K - Medium | 2.36 | 2260.54 | 86.31 | 8.6501 | 9557.409222 |
IQ2_XXS | llama 7B IQ2_XXS - 2.0625 bpw | 1.73 | 2288.92 | 78.62 | 11.0326 | 9428.436439 |
IQ3_XS | llama 7B IQ3_XS - 3.3 bpw | 2.6 | 2242.87 | 79.56 | 7.8787 | 8711.059185 |
IQ3_S | llama 7B IQ3_S - 3.4375 bpw | 2.75 | 2238.22 | 76.3 | 7.7904 | 7971.404713 |
Q3_K_S | llama 7B Q3_K - Small | 2.75 | 2241.64 | 75.79 | 8.0321 | 7691.587306 |
Conclusion
New I-Quants tend to be slower on M1 Max but faster on NVIDIA 4060 Ti 16GB though perplexities of I-Quants tend to be higher than the other quants.
On M1 Max, Q4_0 is the best quant type considering pp, tg, and ppl.
On 4060 Ti 16GB, IQ1_S is the best quant type considering pp and tg, but its perplexity doesn’t look good. With perplexity, IQ2_M is the best. If you add size here, IQ2_XS is the best.
Though it is not so fast on M1 Max, smaller I-Quants enables even models of 120B to run in 48GB of VRAM on M1 Max with 8k or 16k context length. The size of IQ2_XS quantized 120B model is 35.38GB.
Benchmarks for lots of quantization types in llama.cpp
https://beebopkim.github.io/2024/03/09/Benchmarks-for-lots-of-quantization-types-in-llama-cpp/