Benchmarks for lots of quantization types in llama.cpp

Recently, I noticed that lots of new quantization types were added to llama.cpp. So just curious, I decided to some simple tests on every llama.cpp’s quantization types. There are total 27 types of quantization in llama.cpp including F16 and F32.

Measurement Setup

The commit hash of llama.cpp, which was used for this measument, is d5ab2975, also tag b2296.

The model used for this measurement is meta-llama/Llama-2-7b-chat-hf.

The data used to generate imatrix calibration data for this measurement is 20k_random_data.txt from Importance matrix calculations work best on near-random data #5006. The imatrix calibration data was used only for quantiziong I-Quant like IQ2_XS, IQ4_NL, or etc.

The computers used in this measurement are M1 Max Mac Studio (32 GPU 64GB) and Linux machone with Intel 12400F and NVIDIA 4060 Ti (CUDA 12.3) on Ubuntu 22.04.

The perplexity is measured using wikitext2-raw-v1/train from wikitext.

Perplexity

I measured it with perplexity included in llama.cpp.

quant type ppl
F32 7.4924 +/- 0.05038
F16 7.4924 +/- 0.05038
Q8_0 7.4933 +/- 0.05040
Q6_K 7.4950 +/- 0.05042
Q5_1 7.5084 +/- 0.05049
Q5_K_M 7.5099 +/- 0.05051
Q5_K_S 7.5180 +/- 0.05059
IQ4_XS 7.5231 +/- 0.05021
IQ4_NL 7.5392 +/- 0.05044
Q4_K_M 7.5692 +/- 0.05087
Q4_1 7.5913 +/- 0.05104
Q4_K_S 7.6066 +/- 0.05119
Q4_0 7.6261 +/- 0.05130
Q3_K_L 7.6491 +/- 0.05110
Q3_K_M 7.6854 +/- 0.05128
IQ3_M 7.7695 +/- 0.05262
IQ3_S 7.7904 +/- 0.05252
IQ3_XS 7.8787 +/- 0.05295
Q3_K_S 8.0321 +/- 0.05409
IQ3_XXS 8.2039 +/- 0.05497
IQ2_M 8.6002 +/- 0.05749
Q2_K 8.6501 +/- 0.05852
IQ2_S 9.1459 +/- 0.06077
Q2_K_S 9.1756 +/- 0.06047
IQ2_XS 9.7873 +/- 0.06424
IQ2_XXS 11.0326 +/- 0.07234
IQ1_S 28.7926 +/- 0.19637

llama-bench on M1 Max 32 GPU

I didn’t buy my M1 Max Mac Studio to run LLMs, but I can use up to 48GB of VRAM, so I prefer it. Sorted in descending order by size.

model size params backend ngl test t/s
llama 7B all F32 25.10 GiB 6.74 B Metal 99 pp 512 549.31 ± 1.32
llama 7B all F32 25.10 GiB 6.74 B Metal 99 tg 128 12.60 ± 0.08
llama 7B F16 12.55 GiB 6.74 B Metal 99 pp 512 603.07 ± 4.25
llama 7B F16 12.55 GiB 6.74 B Metal 99 tg 128 23.18 ± 0.10
llama 7B Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 547.28 ± 0.46
llama 7B Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 40.38 ± 0.01
llama 7B Q6_K 5.15 GiB 6.74 B Metal 99 pp 512 444.28 ± 2.46
llama 7B Q6_K 5.15 GiB 6.74 B Metal 99 tg 128 39.03 ± 0.20
llama 7B Q5_1 4.72 GiB 6.74 B Metal 99 pp 512 455.69 ± 3.64
llama 7B Q5_1 4.72 GiB 6.74 B Metal 99 tg 128 41.92 ± 0.18
llama 7B Q5_K - Medium 4.45 GiB 6.74 B Metal 99 pp 512 428.63 ± 0.80
llama 7B Q5_K - Medium 4.45 GiB 6.74 B Metal 99 tg 128 39.67 ± 0.05
llama 7B Q5_K - Small 4.33 GiB 6.74 B Metal 99 pp 512 422.92 ± 1.60
llama 7B Q5_K - Small 4.33 GiB 6.74 B Metal 99 tg 128 40.34 ± 0.25
llama 7B Q4_1 3.95 GiB 6.74 B Metal 99 pp 512 534.08 ± 1.73
llama 7B Q4_1 3.95 GiB 6.74 B Metal 99 tg 128 57.19 ± 0.51
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 99 pp 512 469.91 ± 2.67
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 99 tg 128 49.24 ± 0.30
llama 7B Q4_K - Small 3.59 GiB 6.74 B Metal 99 pp 512 474.25 ± 0.59
llama 7B Q4_K - Small 3.59 GiB 6.74 B Metal 99 tg 128 52.02 ± 0.50
llama 7B Q4_0 3.57 GiB 6.74 B Metal 99 pp 512 536.36 ± 2.01
llama 7B Q4_0 3.57 GiB 6.74 B Metal 99 tg 128 61.39 ± 0.59
llama 7B IQ4_NL - 4.5 bpw 3.56 GiB 6.74 B Metal 99 pp 512 520.62 ± 0.52
llama 7B IQ4_NL - 4.5 bpw 3.56 GiB 6.74 B Metal 99 tg 128 53.15 ± 0.05
llama 7B IQ4_XS - 4.25 bpw 3.37 GiB 6.74 B Metal 99 pp 512 491.69 ± 0.44
llama 7B IQ4_XS - 4.25 bpw 3.37 GiB 6.74 B Metal 99 tg 128 51.86 ± 0.06
llama 7B Q3_K - Large 3.35 GiB 6.74 B Metal 99 pp 512 449.53 ± 0.64
llama 7B Q3_K - Large 3.35 GiB 6.74 B Metal 99 tg 128 40.93 ± 0.02
llama 7B Q3_K - Medium 3.07 GiB 6.74 B Metal 99 pp 512 469.64 ± 0.61
llama 7B Q3_K - Medium 3.07 GiB 6.74 B Metal 99 tg 128 44.19 ± 0.01
llama 7B IQ3_S mix - 3.66 bpw 2.90 GiB 6.74 B Metal 99 pp 512 459.51 ± 0.40
llama 7B IQ3_S mix - 3.66 bpw 2.90 GiB 6.74 B Metal 99 tg 128 47.68 ± 0.04
llama 7B IQ3_S - 3.4375 bpw 2.75 GiB 6.74 B Metal 99 pp 512 456.12 ± 0.32
llama 7B IQ3_S - 3.4375 bpw 2.75 GiB 6.74 B Metal 99 tg 128 47.74 ± 0.04
llama 7B Q3_K - Small 2.75 GiB 6.74 B Metal 99 pp 512 466.80 ± 0.27
llama 7B Q3_K - Small 2.75 GiB 6.74 B Metal 99 tg 128 42.60 ± 0.05
llama 7B IQ3_XS - 3.3 bpw 2.60 GiB 6.74 B Metal 99 pp 512 464.88 ± 0.42
llama 7B IQ3_XS - 3.3 bpw 2.60 GiB 6.74 B Metal 99 tg 128 47.87 ± 0.05
llama 7B IQ3_XXS - 3.0625 bpw 2.41 GiB 6.74 B Metal 99 pp 512 470.03 ± 0.38
llama 7B IQ3_XXS - 3.0625 bpw 2.41 GiB 6.74 B Metal 99 tg 128 46.44 ± 0.07
llama 7B Q2_K - Medium 2.36 GiB 6.74 B Metal 99 pp 512 493.38 ± 0.42
llama 7B Q2_K - Medium 2.36 GiB 6.74 B Metal 99 tg 128 51.57 ± 0.11
llama 7B IQ2_M - 2.7 bpw 2.20 GiB 6.74 B Metal 99 pp 512 462.36 ± 0.45
llama 7B IQ2_M - 2.7 bpw 2.20 GiB 6.74 B Metal 99 tg 128 38.38 ± 0.03
llama 7B Q2_K - Small 2.16 GiB 6.74 B Metal 99 pp 512 512.21 ± 0.57
llama 7B Q2_K - Small 2.16 GiB 6.74 B Metal 99 tg 128 62.49 ± 0.20
llama 7B IQ2_S - 2.5 bpw 2.05 GiB 6.74 B Metal 99 pp 512 469.49 ± 1.99
llama 7B IQ2_S - 2.5 bpw 2.05 GiB 6.74 B Metal 99 tg 128 47.32 ± 0.26
llama 7B IQ2_XS - 2.3125 bpw 1.89 GiB 6.74 B Metal 99 pp 512 472.16 ± 0.41
llama 7B IQ2_XS - 2.3125 bpw 1.89 GiB 6.74 B Metal 99 tg 128 48.42 ± 0.03
llama 7B IQ2_XXS - 2.0625 bpw 1.73 GiB 6.74 B Metal 99 pp 512 462.08 ± 0.30
llama 7B IQ2_XXS - 2.0625 bpw 1.73 GiB 6.74 B Metal 99 tg 128 50.45 ± 0.03
llama 7B IQ1_S - 1.5625 bpw 1.42 GiB 6.74 B Metal 99 pp 512 499.09 ± 0.56
llama 7B IQ1_S - 1.5625 bpw 1.42 GiB 6.74 B Metal 99 tg 128 53.01 ± 0.33

It’s quite long, so I did some very simple analysis on it.

Best Prompt Processing and Token Generation on M1 Max 32 GPU

I multiplied pp and tg, and sorted it in descending order. Considering pp and tg, Q4_0 is best. Because the list is too long, I put only the top 10 of it. Result of F32 was dropped because actually nobody runs F32 model with llama.cpp.

quant type model size GiB pp 512 t/s tg 128 t/s ppl pp * tg
Q4_0 llama 7B Q4_0 3.57 536.36 61.39 7.6261 32927.1404
Q2_K_S llama 7B Q2_K - Small 2.16 512.21 62.49 9.1756 32008.0029
Q4_1 llama 7B Q4_1 3.95 534.08 57.19 7.5913 30544.0352
IQ4_NL llama 7B IQ4_NL - 4.5 bpw 3.56 520.62 53.15 7.5392 27670.953
IQ1_S llama 7B IQ1_S - 1.5625 bpw 1.42 499.09 53.01 28.7926 26456.7609
IQ4_XS llama 7B IQ4_XS - 4.25 bpw 3.37 491.69 51.86 7.5231 25499.0434
Q2_K llama 7B Q2_K - Medium 2.36 493.38 51.57 8.6501 25443.6066
Q4_K_S llama 7B Q4_K - Small 3.59 474.25 52.02 7.6066 24670.485
IQ2_XXS llama 7B IQ2_XXS - 2.0625 bpw 1.73 462.08 50.45 11.0326 23311.936
Q4_K_M llama 7B Q4_K - Medium 3.8 469.91 49.24 7.5692 23138.3684

Best Prompt Processing and Token Generation over Perplexity on M1 Max 32 GPU

The lower the perplexity, the better. So I divided pp * tg by ppl and sorted it in descending order. Still, Q4_0 is the best but the rest of the list is changed. The rank of Q2_K_S is high though its perplexity is bad.

quant type model size GiB pp 512 t/s tg 128 t/s ppl pp * tg / ppl
Q4_0 llama 7B Q4_0 3.57 536.36 61.39 7.6261 4317.69062
Q4_1 llama 7B Q4_1 3.95 534.08 57.19 7.5913 4023.55791
IQ4_NL llama 7B IQ4_NL - 4.5 bpw 3.56 520.62 53.15 7.5392 3670.27709
Q2_K_S llama 7B Q2_K - Small 2.16 512.21 62.49 9.1756 3488.38255
IQ4_XS llama 7B IQ4_XS - 4.25 bpw 3.37 491.69 51.86 7.5231 3389.433
Q4_K_S llama 7B Q4_K - Small 3.59 474.25 52.02 7.6066 3243.2999
Q4_K_M llama 7B Q4_K - Medium 3.8 469.91 49.24 7.5692 3056.91069
Q8_0 llama 7B Q8_0 6.67 547.28 40.38 7.4933 2949.19013
Q2_K llama 7B Q2_K - Medium 2.36 493.38 51.57 8.6501 2941.42341
IQ3_XS llama 7B IQ3_XS - 3.3 bpw 2.6 464.88 47.87 7.8787 2824.55298

llama-bench on NVIDIA 4060 Ti 16GB

Same as M1 Max, this table is sorted by size in descending order. Because lack of VRAM, there is no F32 result.

model size params backend ngl test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 pp 512 3141.27 ± 47.64
llama 7B F16 12.55 GiB 6.74 B CUDA 99 tg 128 19.55 ± 0.00
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 pp 512 2105.29 ± 1.55
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 tg 128 35.52 ± 0.01
llama 7B Q6_K 5.15 GiB 6.74 B CUDA 99 pp 512 2152.08 ± 1.97
llama 7B Q6_K 5.15 GiB 6.74 B CUDA 99 tg 128 44.75 ± 0.02
llama 7B Q5_1 4.72 GiB 6.74 B CUDA 99 pp 512 2174.41 ± 1.67
llama 7B Q5_1 4.72 GiB 6.74 B CUDA 99 tg 128 48.58 ± 0.02
llama 7B Q5_K - Medium 4.45 GiB 6.74 B CUDA 99 pp 512 2189.18 ± 2.62
llama 7B Q5_K - Medium 4.45 GiB 6.74 B CUDA 99 tg 128 51.01 ± 0.02
llama 7B Q5_K - Small 4.33 GiB 6.74 B CUDA 99 pp 512 2195.82 ± 2.89
llama 7B Q5_K - Small 4.33 GiB 6.74 B CUDA 99 tg 128 52.30 ± 0.02
llama 7B Q4_1 3.95 GiB 6.74 B CUDA 99 pp 512 2203.94 ± 2.60
llama 7B Q4_1 3.95 GiB 6.74 B CUDA 99 tg 128 56.83 ± 0.03
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 pp 512 2208.96 ± 2.55
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 tg 128 58.49 ± 0.03
llama 7B Q4_K - Small 3.59 GiB 6.74 B CUDA 99 pp 512 2218.12 ± 2.64
llama 7B Q4_K - Small 3.59 GiB 6.74 B CUDA 99 tg 128 61.44 ± 0.03
llama 7B Q4_0 3.57 GiB 6.74 B CUDA 99 pp 512 2219.31 ± 1.77
llama 7B Q4_0 3.57 GiB 6.74 B CUDA 99 tg 128 61.80 ± 0.03
llama 7B IQ4_NL - 4.5 bpw 3.56 GiB 6.74 B CUDA 99 pp 512 2215.14 ± 1.76
llama 7B IQ4_NL - 4.5 bpw 3.56 GiB 6.74 B CUDA 99 tg 128 61.70 ± 0.03
llama 7B IQ4_XS - 4.25 bpw 3.37 GiB 6.74 B CUDA 99 pp 512 2220.83 ± 1.47
llama 7B IQ4_XS - 4.25 bpw 3.37 GiB 6.74 B CUDA 99 tg 128 64.79 ± 0.03
llama 7B Q3_K - Large 3.35 GiB 6.74 B CUDA 99 pp 512 2217.48 ± 2.38
llama 7B Q3_K - Large 3.35 GiB 6.74 B CUDA 99 tg 128 64.82 ± 0.03
llama 7B Q3_K - Medium 3.07 GiB 6.74 B CUDA 99 pp 512 2229.58 ± 2.79
llama 7B Q3_K - Medium 3.07 GiB 6.74 B CUDA 99 tg 128 69.80 ± 0.03
llama 7B IQ3_S mix - 3.66 bpw 2.90 GiB 6.74 B CUDA 99 pp 512 2231.57 ± 2.73
llama 7B IQ3_S mix - 3.66 bpw 2.90 GiB 6.74 B CUDA 99 tg 128 73.05 ± 0.05
llama 7B IQ3_S - 3.4375 bpw 2.75 GiB 6.74 B CUDA 99 pp 512 2238.22 ± 2.44
llama 7B IQ3_S - 3.4375 bpw 2.75 GiB 6.74 B CUDA 99 tg 128 76.30 ± 0.04
llama 7B Q3_K - Small 2.75 GiB 6.74 B CUDA 99 pp 512 2241.64 ± 2.73
llama 7B Q3_K - Small 2.75 GiB 6.74 B CUDA 99 tg 128 75.79 ± 0.02
llama 7B IQ3_XS - 3.3 bpw 2.60 GiB 6.74 B CUDA 99 pp 512 2242.87 ± 2.71
llama 7B IQ3_XS - 3.3 bpw 2.60 GiB 6.74 B CUDA 99 tg 128 79.56 ± 0.05
llama 7B IQ3_XXS - 3.0625 bpw 2.41 GiB 6.74 B CUDA 99 pp 512 2251.70 ± 3.08
llama 7B IQ3_XXS - 3.0625 bpw 2.41 GiB 6.74 B CUDA 99 tg 128 84.73 ± 0.04
llama 7B Q2_K - Medium 2.36 GiB 6.74 B CUDA 99 pp 512 2260.54 ± 2.69
llama 7B Q2_K - Medium 2.36 GiB 6.74 B CUDA 99 tg 128 86.31 ± 0.04
llama 7B IQ2_M - 2.7 bpw 2.20 GiB 6.74 B CUDA 99 pp 512 2260.25 ± 2.62
llama 7B IQ2_M - 2.7 bpw 2.20 GiB 6.74 B CUDA 99 tg 128 91.42 ± 0.07
llama 7B Q2_K - Small 2.16 GiB 6.74 B CUDA 99 pp 512 2266.25 ± 1.72
llama 7B Q2_K - Small 2.16 GiB 6.74 B CUDA 99 tg 128 93.02 ± 0.07
llama 7B IQ2_S - 2.5 bpw 2.05 GiB 6.74 B CUDA 99 pp 512 2265.55 ± 2.17
llama 7B IQ2_S - 2.5 bpw 2.05 GiB 6.74 B CUDA 99 tg 128 96.16 ± 0.08
llama 7B IQ2_XS - 2.3125 bpw 1.89 GiB 6.74 B CUDA 99 pp 512 2274.67 ± 1.97
llama 7B IQ2_XS - 2.3125 bpw 1.89 GiB 6.74 B CUDA 99 tg 128 101.77 ± 0.08
llama 7B IQ2_XXS - 2.0625 bpw 1.73 GiB 6.74 B CUDA 99 pp 512 2288.92 ± 2.73
llama 7B IQ2_XXS - 2.0625 bpw 1.73 GiB 6.74 B CUDA 99 tg 128 78.62 ± 0.09
llama 7B IQ1_S - 1.5625 bpw 1.42 GiB 6.74 B CUDA 99 pp 512 2306.16 ± 1.44
llama 7B IQ1_S - 1.5625 bpw 1.42 GiB 6.74 B CUDA 99 tg 128 114.36 ± 0.12

Best Prompt Processing and Token Generation on NVIDIA 4060 Ti 16GB

As you can see, I-Quant types are faster than M1 Max.

quant type model size GiB pp 512 t/s tg 128 t/s ppl pp * tg
IQ1_S llama 7B IQ1_S - 1.5625 bpw 1.42 2306.16 114.36 28.7926 263732.458
IQ2_XS llama 7B IQ2_XS - 2.3125 bpw 1.89 2274.67 101.77 9.7873 231493.166
IQ2_S llama 7B IQ2_S - 2.5 bpw 2.05 2265.55 96.16 9.1459 217855.288
Q2_K_S llama 7B Q2_K - Small 2.16 2266.25 93.02 9.1756 210806.575
IQ2_M llama 7B IQ2_M - 2.7 bpw 2.2 2260.25 91.42 8.6002 206632.055
Q2_K llama 7B Q2_K - Medium 2.36 2260.54 86.31 8.6501 195107.207
IQ3_XXS llama 7B IQ3_XXS - 3.0625 bpw 2.41 2251.7 84.73 8.2039 190786.541
IQ2_XXS llama 7B IQ2_XXS - 2.0625 bpw 1.73 2288.92 78.62 11.0326 179954.89
IQ3_XS llama 7B IQ3_XS - 3.3 bpw 2.6 2242.87 79.56 7.8787 178442.737
IQ3_S llama 7B IQ3_S - 3.4375 bpw 2.75 2238.22 76.3 7.7904 170776.186

Best Prompt Processing and Token Generation over Perplexity on NVIDIA 4060 Ti 16GB

The top ten list of NVIDIA’s are showing higher perplexity than M1 Max.

quant type model size GiB pp 512 t/s tg 128 t/s ppl pp * tg / ppl
IQ2_M llama 7B IQ2_M - 2.7 bpw 2.2 2260.25 91.42 8.6002 24026.42439
IQ2_S llama 7B IQ2_S - 2.5 bpw 2.05 2265.55 96.16 9.1459 23819.99453
IQ2_XS llama 7B IQ2_XS - 2.3125 bpw 1.89 2274.67 101.77 9.7873 23652.40321
IQ3_XXS llama 7B IQ3_XXS - 3.0625 bpw 2.41 2251.7 84.73 8.2039 23255.59076
Q2_K_S llama 7B Q2_K - Small 2.16 2266.25 93.02 9.1756 22974.69103
IQ3_XS llama 7B IQ3_XS - 3.3 bpw 2.6 2242.87 79.56 7.8787 22648.75388
Q2_K llama 7B Q2_K - Medium 2.36 2260.54 86.31 8.6501 22555.48576
IQ3_S llama 7B IQ3_S - 3.4375 bpw 2.75 2238.22 76.3 7.7904 21921.36296
Q3_K_S llama 7B Q3_K - Small 2.75 2241.64 75.79 8.0321 21151.86509
IQ3_M llama 7B IQ3_S mix - 3.66 bpw 2.9 2231.57 73.05 7.7695 20981.5546

Best Prompt Processing and Token Generation over Perplexity and Size on NVIDIA 4060 Ti 16GB

16GB of VRAM is not so big, so I divided pp * tg / ppl by size.

quant type model size GiB pp 512 t/s tg 128 t/s ppl pp * tg / ppl / size
IQ2_XS llama 7B IQ2_XS - 2.3125 bpw 1.89 2274.67 101.77 9.7873 12514.49905
IQ2_S llama 7B IQ2_S - 2.5 bpw 2.05 2265.55 96.16 9.1459 11619.50953
IQ2_M llama 7B IQ2_M - 2.7 bpw 2.2 2260.25 91.42 8.6002 10921.10199
Q2_K_S llama 7B Q2_K - Small 2.16 2266.25 93.02 9.1756 10636.43103
IQ3_XXS llama 7B IQ3_XXS - 3.0625 bpw 2.41 2251.7 84.73 8.2039 9649.62272
Q2_K llama 7B Q2_K - Medium 2.36 2260.54 86.31 8.6501 9557.409222
IQ2_XXS llama 7B IQ2_XXS - 2.0625 bpw 1.73 2288.92 78.62 11.0326 9428.436439
IQ3_XS llama 7B IQ3_XS - 3.3 bpw 2.6 2242.87 79.56 7.8787 8711.059185
IQ3_S llama 7B IQ3_S - 3.4375 bpw 2.75 2238.22 76.3 7.7904 7971.404713
Q3_K_S llama 7B Q3_K - Small 2.75 2241.64 75.79 8.0321 7691.587306

Conclusion

New I-Quants tend to be slower on M1 Max but faster on NVIDIA 4060 Ti 16GB though perplexities of I-Quants tend to be higher than the other quants.

On M1 Max, Q4_0 is the best quant type considering pp, tg, and ppl.

On 4060 Ti 16GB, IQ1_S is the best quant type considering pp and tg, but its perplexity doesn’t look good. With perplexity, IQ2_M is the best. If you add size here, IQ2_XS is the best.

Though it is not so fast on M1 Max, smaller I-Quants enables even models of 120B to run in 48GB of VRAM on M1 Max with 8k or 16k context length. The size of IQ2_XS quantized 120B model is 35.38GB.

Author

beebopkim

Posted on

2024-03-09

Updated on

2024-03-10

Licensed under

Comments