Posted 2024-03-09Updated 2024-03-1014 minutes read (About 2028 words)

Benchmarks for lots of quantization types in llama.cpp

Recently, I noticed that lots of new quantization types were added to llama.cpp. So just curious, I decided to some simple tests on every llama.cpp’s quantization types. There are total 27 types of quantization in llama.cpp including F16 and F32.

Measurement Setup

The commit hash of llama.cpp, which was used for this measument, is d5ab2975, also tag b2296.

The model used for this measurement is meta-llama/Llama-2-7b-chat-hf.

The data used to generate imatrix calibration data for this measurement is 20k_random_data.txt from Importance matrix calculations work best on near-random data #5006. The imatrix calibration data was used only for quantiziong I-Quant like IQ2_XS, IQ4_NL, or etc.

The computers used in this measurement are M1 Max Mac Studio (32 GPU 64GB) and Linux machone with Intel 12400F and NVIDIA 4060 Ti (CUDA 12.3) on Ubuntu 22.04.

The perplexity is measured using wikitext2-raw-v1/train from wikitext.

Perplexity

I measured it with perplexity included in llama.cpp.

quant type	ppl
F32	7.4924 +/- 0.05038
F16	7.4924 +/- 0.05038
Q8_0	7.4933 +/- 0.05040
Q6_K	7.4950 +/- 0.05042
Q5_1	7.5084 +/- 0.05049
Q5_K_M	7.5099 +/- 0.05051
Q5_K_S	7.5180 +/- 0.05059
IQ4_XS	7.5231 +/- 0.05021
IQ4_NL	7.5392 +/- 0.05044
Q4_K_M	7.5692 +/- 0.05087
Q4_1	7.5913 +/- 0.05104
Q4_K_S	7.6066 +/- 0.05119
Q4_0	7.6261 +/- 0.05130
Q3_K_L	7.6491 +/- 0.05110
Q3_K_M	7.6854 +/- 0.05128
IQ3_M	7.7695 +/- 0.05262
IQ3_S	7.7904 +/- 0.05252
IQ3_XS	7.8787 +/- 0.05295
Q3_K_S	8.0321 +/- 0.05409
IQ3_XXS	8.2039 +/- 0.05497
IQ2_M	8.6002 +/- 0.05749
Q2_K	8.6501 +/- 0.05852
IQ2_S	9.1459 +/- 0.06077
Q2_K_S	9.1756 +/- 0.06047
IQ2_XS	9.7873 +/- 0.06424
IQ2_XXS	11.0326 +/- 0.07234
IQ1_S	28.7926 +/- 0.19637

llama-bench on M1 Max 32 GPU

I didn’t buy my M1 Max Mac Studio to run LLMs, but I can use up to 48GB of VRAM, so I prefer it. Sorted in descending order by size.

model	size	params	backend	ngl	test	t/s
llama 7B all F32	25.10 GiB	6.74 B	Metal	99	pp 512	549.31 ± 1.32
llama 7B all F32	25.10 GiB	6.74 B	Metal	99	tg 128	12.60 ± 0.08
llama 7B F16	12.55 GiB	6.74 B	Metal	99	pp 512	603.07 ± 4.25
llama 7B F16	12.55 GiB	6.74 B	Metal	99	tg 128	23.18 ± 0.10
llama 7B Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	547.28 ± 0.46
llama 7B Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.38 ± 0.01
llama 7B Q6_K	5.15 GiB	6.74 B	Metal	99	pp 512	444.28 ± 2.46
llama 7B Q6_K	5.15 GiB	6.74 B	Metal	99	tg 128	39.03 ± 0.20
llama 7B Q5_1	4.72 GiB	6.74 B	Metal	99	pp 512	455.69 ± 3.64
llama 7B Q5_1	4.72 GiB	6.74 B	Metal	99	tg 128	41.92 ± 0.18
llama 7B Q5_K - Medium	4.45 GiB	6.74 B	Metal	99	pp 512	428.63 ± 0.80
llama 7B Q5_K - Medium	4.45 GiB	6.74 B	Metal	99	tg 128	39.67 ± 0.05
llama 7B Q5_K - Small	4.33 GiB	6.74 B	Metal	99	pp 512	422.92 ± 1.60
llama 7B Q5_K - Small	4.33 GiB	6.74 B	Metal	99	tg 128	40.34 ± 0.25
llama 7B Q4_1	3.95 GiB	6.74 B	Metal	99	pp 512	534.08 ± 1.73
llama 7B Q4_1	3.95 GiB	6.74 B	Metal	99	tg 128	57.19 ± 0.51
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	99	pp 512	469.91 ± 2.67
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	99	tg 128	49.24 ± 0.30
llama 7B Q4_K - Small	3.59 GiB	6.74 B	Metal	99	pp 512	474.25 ± 0.59
llama 7B Q4_K - Small	3.59 GiB	6.74 B	Metal	99	tg 128	52.02 ± 0.50
llama 7B Q4_0	3.57 GiB	6.74 B	Metal	99	pp 512	536.36 ± 2.01
llama 7B Q4_0	3.57 GiB	6.74 B	Metal	99	tg 128	61.39 ± 0.59
llama 7B IQ4_NL - 4.5 bpw	3.56 GiB	6.74 B	Metal	99	pp 512	520.62 ± 0.52
llama 7B IQ4_NL - 4.5 bpw	3.56 GiB	6.74 B	Metal	99	tg 128	53.15 ± 0.05
llama 7B IQ4_XS - 4.25 bpw	3.37 GiB	6.74 B	Metal	99	pp 512	491.69 ± 0.44
llama 7B IQ4_XS - 4.25 bpw	3.37 GiB	6.74 B	Metal	99	tg 128	51.86 ± 0.06
llama 7B Q3_K - Large	3.35 GiB	6.74 B	Metal	99	pp 512	449.53 ± 0.64
llama 7B Q3_K - Large	3.35 GiB	6.74 B	Metal	99	tg 128	40.93 ± 0.02
llama 7B Q3_K - Medium	3.07 GiB	6.74 B	Metal	99	pp 512	469.64 ± 0.61
llama 7B Q3_K - Medium	3.07 GiB	6.74 B	Metal	99	tg 128	44.19 ± 0.01
llama 7B IQ3_S mix - 3.66 bpw	2.90 GiB	6.74 B	Metal	99	pp 512	459.51 ± 0.40
llama 7B IQ3_S mix - 3.66 bpw	2.90 GiB	6.74 B	Metal	99	tg 128	47.68 ± 0.04
llama 7B IQ3_S - 3.4375 bpw	2.75 GiB	6.74 B	Metal	99	pp 512	456.12 ± 0.32
llama 7B IQ3_S - 3.4375 bpw	2.75 GiB	6.74 B	Metal	99	tg 128	47.74 ± 0.04
llama 7B Q3_K - Small	2.75 GiB	6.74 B	Metal	99	pp 512	466.80 ± 0.27
llama 7B Q3_K - Small	2.75 GiB	6.74 B	Metal	99	tg 128	42.60 ± 0.05
llama 7B IQ3_XS - 3.3 bpw	2.60 GiB	6.74 B	Metal	99	pp 512	464.88 ± 0.42
llama 7B IQ3_XS - 3.3 bpw	2.60 GiB	6.74 B	Metal	99	tg 128	47.87 ± 0.05
llama 7B IQ3_XXS - 3.0625 bpw	2.41 GiB	6.74 B	Metal	99	pp 512	470.03 ± 0.38
llama 7B IQ3_XXS - 3.0625 bpw	2.41 GiB	6.74 B	Metal	99	tg 128	46.44 ± 0.07
llama 7B Q2_K - Medium	2.36 GiB	6.74 B	Metal	99	pp 512	493.38 ± 0.42
llama 7B Q2_K - Medium	2.36 GiB	6.74 B	Metal	99	tg 128	51.57 ± 0.11
llama 7B IQ2_M - 2.7 bpw	2.20 GiB	6.74 B	Metal	99	pp 512	462.36 ± 0.45
llama 7B IQ2_M - 2.7 bpw	2.20 GiB	6.74 B	Metal	99	tg 128	38.38 ± 0.03
llama 7B Q2_K - Small	2.16 GiB	6.74 B	Metal	99	pp 512	512.21 ± 0.57
llama 7B Q2_K - Small	2.16 GiB	6.74 B	Metal	99	tg 128	62.49 ± 0.20
llama 7B IQ2_S - 2.5 bpw	2.05 GiB	6.74 B	Metal	99	pp 512	469.49 ± 1.99
llama 7B IQ2_S - 2.5 bpw	2.05 GiB	6.74 B	Metal	99	tg 128	47.32 ± 0.26
llama 7B IQ2_XS - 2.3125 bpw	1.89 GiB	6.74 B	Metal	99	pp 512	472.16 ± 0.41
llama 7B IQ2_XS - 2.3125 bpw	1.89 GiB	6.74 B	Metal	99	tg 128	48.42 ± 0.03
llama 7B IQ2_XXS - 2.0625 bpw	1.73 GiB	6.74 B	Metal	99	pp 512	462.08 ± 0.30
llama 7B IQ2_XXS - 2.0625 bpw	1.73 GiB	6.74 B	Metal	99	tg 128	50.45 ± 0.03
llama 7B IQ1_S - 1.5625 bpw	1.42 GiB	6.74 B	Metal	99	pp 512	499.09 ± 0.56
llama 7B IQ1_S - 1.5625 bpw	1.42 GiB	6.74 B	Metal	99	tg 128	53.01 ± 0.33

It’s quite long, so I did some very simple analysis on it.

Best Prompt Processing and Token Generation on M1 Max 32 GPU

I multiplied pp and tg, and sorted it in descending order. Considering pp and tg, Q4_0 is best. Because the list is too long, I put only the top 10 of it. Result of F32 was dropped because actually nobody runs F32 model with llama.cpp.

quant type	model	size GiB	pp 512 t/s	tg 128 t/s	ppl	pp * tg
Q4_0	llama 7B Q4_0	3.57	536.36	61.39	7.6261	32927.1404
Q2_K_S	llama 7B Q2_K - Small	2.16	512.21	62.49	9.1756	32008.0029
Q4_1	llama 7B Q4_1	3.95	534.08	57.19	7.5913	30544.0352
IQ4_NL	llama 7B IQ4_NL - 4.5 bpw	3.56	520.62	53.15	7.5392	27670.953
IQ1_S	llama 7B IQ1_S - 1.5625 bpw	1.42	499.09	53.01	28.7926	26456.7609
IQ4_XS	llama 7B IQ4_XS - 4.25 bpw	3.37	491.69	51.86	7.5231	25499.0434
Q2_K	llama 7B Q2_K - Medium	2.36	493.38	51.57	8.6501	25443.6066
Q4_K_S	llama 7B Q4_K - Small	3.59	474.25	52.02	7.6066	24670.485
IQ2_XXS	llama 7B IQ2_XXS - 2.0625 bpw	1.73	462.08	50.45	11.0326	23311.936
Q4_K_M	llama 7B Q4_K - Medium	3.8	469.91	49.24	7.5692	23138.3684

Best Prompt Processing and Token Generation over Perplexity on M1 Max 32 GPU

The lower the perplexity, the better. So I divided pp * tg by ppl and sorted it in descending order. Still, Q4_0 is the best but the rest of the list is changed. The rank of Q2_K_S is high though its perplexity is bad.

quant type	model	size GiB	pp 512 t/s	tg 128 t/s	ppl	pp * tg / ppl
Q4_0	llama 7B Q4_0	3.57	536.36	61.39	7.6261	4317.69062
Q4_1	llama 7B Q4_1	3.95	534.08	57.19	7.5913	4023.55791
IQ4_NL	llama 7B IQ4_NL - 4.5 bpw	3.56	520.62	53.15	7.5392	3670.27709
Q2_K_S	llama 7B Q2_K - Small	2.16	512.21	62.49	9.1756	3488.38255
IQ4_XS	llama 7B IQ4_XS - 4.25 bpw	3.37	491.69	51.86	7.5231	3389.433
Q4_K_S	llama 7B Q4_K - Small	3.59	474.25	52.02	7.6066	3243.2999
Q4_K_M	llama 7B Q4_K - Medium	3.8	469.91	49.24	7.5692	3056.91069
Q8_0	llama 7B Q8_0	6.67	547.28	40.38	7.4933	2949.19013
Q2_K	llama 7B Q2_K - Medium	2.36	493.38	51.57	8.6501	2941.42341
IQ3_XS	llama 7B IQ3_XS - 3.3 bpw	2.6	464.88	47.87	7.8787	2824.55298

llama-bench on NVIDIA 4060 Ti 16GB

Same as M1 Max, this table is sorted by size in descending order. Because lack of VRAM, there is no F32 result.

model	size	params	backend	ngl	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	pp 512	3141.27 ± 47.64
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	tg 128	19.55 ± 0.00
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	pp 512	2105.29 ± 1.55
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	tg 128	35.52 ± 0.01
llama 7B Q6_K	5.15 GiB	6.74 B	CUDA	99	pp 512	2152.08 ± 1.97
llama 7B Q6_K	5.15 GiB	6.74 B	CUDA	99	tg 128	44.75 ± 0.02
llama 7B Q5_1	4.72 GiB	6.74 B	CUDA	99	pp 512	2174.41 ± 1.67
llama 7B Q5_1	4.72 GiB	6.74 B	CUDA	99	tg 128	48.58 ± 0.02
llama 7B Q5_K - Medium	4.45 GiB	6.74 B	CUDA	99	pp 512	2189.18 ± 2.62
llama 7B Q5_K - Medium	4.45 GiB	6.74 B	CUDA	99	tg 128	51.01 ± 0.02
llama 7B Q5_K - Small	4.33 GiB	6.74 B	CUDA	99	pp 512	2195.82 ± 2.89
llama 7B Q5_K - Small	4.33 GiB	6.74 B	CUDA	99	tg 128	52.30 ± 0.02
llama 7B Q4_1	3.95 GiB	6.74 B	CUDA	99	pp 512	2203.94 ± 2.60
llama 7B Q4_1	3.95 GiB	6.74 B	CUDA	99	tg 128	56.83 ± 0.03
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	pp 512	2208.96 ± 2.55
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	tg 128	58.49 ± 0.03
llama 7B Q4_K - Small	3.59 GiB	6.74 B	CUDA	99	pp 512	2218.12 ± 2.64
llama 7B Q4_K - Small	3.59 GiB	6.74 B	CUDA	99	tg 128	61.44 ± 0.03
llama 7B Q4_0	3.57 GiB	6.74 B	CUDA	99	pp 512	2219.31 ± 1.77
llama 7B Q4_0	3.57 GiB	6.74 B	CUDA	99	tg 128	61.80 ± 0.03
llama 7B IQ4_NL - 4.5 bpw	3.56 GiB	6.74 B	CUDA	99	pp 512	2215.14 ± 1.76
llama 7B IQ4_NL - 4.5 bpw	3.56 GiB	6.74 B	CUDA	99	tg 128	61.70 ± 0.03
llama 7B IQ4_XS - 4.25 bpw	3.37 GiB	6.74 B	CUDA	99	pp 512	2220.83 ± 1.47
llama 7B IQ4_XS - 4.25 bpw	3.37 GiB	6.74 B	CUDA	99	tg 128	64.79 ± 0.03
llama 7B Q3_K - Large	3.35 GiB	6.74 B	CUDA	99	pp 512	2217.48 ± 2.38
llama 7B Q3_K - Large	3.35 GiB	6.74 B	CUDA	99	tg 128	64.82 ± 0.03
llama 7B Q3_K - Medium	3.07 GiB	6.74 B	CUDA	99	pp 512	2229.58 ± 2.79
llama 7B Q3_K - Medium	3.07 GiB	6.74 B	CUDA	99	tg 128	69.80 ± 0.03
llama 7B IQ3_S mix - 3.66 bpw	2.90 GiB	6.74 B	CUDA	99	pp 512	2231.57 ± 2.73
llama 7B IQ3_S mix - 3.66 bpw	2.90 GiB	6.74 B	CUDA	99	tg 128	73.05 ± 0.05
llama 7B IQ3_S - 3.4375 bpw	2.75 GiB	6.74 B	CUDA	99	pp 512	2238.22 ± 2.44
llama 7B IQ3_S - 3.4375 bpw	2.75 GiB	6.74 B	CUDA	99	tg 128	76.30 ± 0.04
llama 7B Q3_K - Small	2.75 GiB	6.74 B	CUDA	99	pp 512	2241.64 ± 2.73
llama 7B Q3_K - Small	2.75 GiB	6.74 B	CUDA	99	tg 128	75.79 ± 0.02
llama 7B IQ3_XS - 3.3 bpw	2.60 GiB	6.74 B	CUDA	99	pp 512	2242.87 ± 2.71
llama 7B IQ3_XS - 3.3 bpw	2.60 GiB	6.74 B	CUDA	99	tg 128	79.56 ± 0.05
llama 7B IQ3_XXS - 3.0625 bpw	2.41 GiB	6.74 B	CUDA	99	pp 512	2251.70 ± 3.08
llama 7B IQ3_XXS - 3.0625 bpw	2.41 GiB	6.74 B	CUDA	99	tg 128	84.73 ± 0.04
llama 7B Q2_K - Medium	2.36 GiB	6.74 B	CUDA	99	pp 512	2260.54 ± 2.69
llama 7B Q2_K - Medium	2.36 GiB	6.74 B	CUDA	99	tg 128	86.31 ± 0.04
llama 7B IQ2_M - 2.7 bpw	2.20 GiB	6.74 B	CUDA	99	pp 512	2260.25 ± 2.62
llama 7B IQ2_M - 2.7 bpw	2.20 GiB	6.74 B	CUDA	99	tg 128	91.42 ± 0.07
llama 7B Q2_K - Small	2.16 GiB	6.74 B	CUDA	99	pp 512	2266.25 ± 1.72
llama 7B Q2_K - Small	2.16 GiB	6.74 B	CUDA	99	tg 128	93.02 ± 0.07
llama 7B IQ2_S - 2.5 bpw	2.05 GiB	6.74 B	CUDA	99	pp 512	2265.55 ± 2.17
llama 7B IQ2_S - 2.5 bpw	2.05 GiB	6.74 B	CUDA	99	tg 128	96.16 ± 0.08
llama 7B IQ2_XS - 2.3125 bpw	1.89 GiB	6.74 B	CUDA	99	pp 512	2274.67 ± 1.97
llama 7B IQ2_XS - 2.3125 bpw	1.89 GiB	6.74 B	CUDA	99	tg 128	101.77 ± 0.08
llama 7B IQ2_XXS - 2.0625 bpw	1.73 GiB	6.74 B	CUDA	99	pp 512	2288.92 ± 2.73
llama 7B IQ2_XXS - 2.0625 bpw	1.73 GiB	6.74 B	CUDA	99	tg 128	78.62 ± 0.09
llama 7B IQ1_S - 1.5625 bpw	1.42 GiB	6.74 B	CUDA	99	pp 512	2306.16 ± 1.44
llama 7B IQ1_S - 1.5625 bpw	1.42 GiB	6.74 B	CUDA	99	tg 128	114.36 ± 0.12

Best Prompt Processing and Token Generation on NVIDIA 4060 Ti 16GB

As you can see, I-Quant types are faster than M1 Max.

quant type	model	size GiB	pp 512 t/s	tg 128 t/s	ppl	pp * tg
IQ1_S	llama 7B IQ1_S - 1.5625 bpw	1.42	2306.16	114.36	28.7926	263732.458
IQ2_XS	llama 7B IQ2_XS - 2.3125 bpw	1.89	2274.67	101.77	9.7873	231493.166
IQ2_S	llama 7B IQ2_S - 2.5 bpw	2.05	2265.55	96.16	9.1459	217855.288
Q2_K_S	llama 7B Q2_K - Small	2.16	2266.25	93.02	9.1756	210806.575
IQ2_M	llama 7B IQ2_M - 2.7 bpw	2.2	2260.25	91.42	8.6002	206632.055
Q2_K	llama 7B Q2_K - Medium	2.36	2260.54	86.31	8.6501	195107.207
IQ3_XXS	llama 7B IQ3_XXS - 3.0625 bpw	2.41	2251.7	84.73	8.2039	190786.541
IQ2_XXS	llama 7B IQ2_XXS - 2.0625 bpw	1.73	2288.92	78.62	11.0326	179954.89
IQ3_XS	llama 7B IQ3_XS - 3.3 bpw	2.6	2242.87	79.56	7.8787	178442.737
IQ3_S	llama 7B IQ3_S - 3.4375 bpw	2.75	2238.22	76.3	7.7904	170776.186

Best Prompt Processing and Token Generation over Perplexity on NVIDIA 4060 Ti 16GB

The top ten list of NVIDIA’s are showing higher perplexity than M1 Max.

quant type	model	size GiB	pp 512 t/s	tg 128 t/s	ppl	pp * tg / ppl
IQ2_M	llama 7B IQ2_M - 2.7 bpw	2.2	2260.25	91.42	8.6002	24026.42439
IQ2_S	llama 7B IQ2_S - 2.5 bpw	2.05	2265.55	96.16	9.1459	23819.99453
IQ2_XS	llama 7B IQ2_XS - 2.3125 bpw	1.89	2274.67	101.77	9.7873	23652.40321
IQ3_XXS	llama 7B IQ3_XXS - 3.0625 bpw	2.41	2251.7	84.73	8.2039	23255.59076
Q2_K_S	llama 7B Q2_K - Small	2.16	2266.25	93.02	9.1756	22974.69103
IQ3_XS	llama 7B IQ3_XS - 3.3 bpw	2.6	2242.87	79.56	7.8787	22648.75388
Q2_K	llama 7B Q2_K - Medium	2.36	2260.54	86.31	8.6501	22555.48576
IQ3_S	llama 7B IQ3_S - 3.4375 bpw	2.75	2238.22	76.3	7.7904	21921.36296
Q3_K_S	llama 7B Q3_K - Small	2.75	2241.64	75.79	8.0321	21151.86509
IQ3_M	llama 7B IQ3_S mix - 3.66 bpw	2.9	2231.57	73.05	7.7695	20981.5546

Best Prompt Processing and Token Generation over Perplexity and Size on NVIDIA 4060 Ti 16GB

16GB of VRAM is not so big, so I divided pp * tg / ppl by size.

quant type	model	size GiB	pp 512 t/s	tg 128 t/s	ppl	pp * tg / ppl / size
IQ2_XS	llama 7B IQ2_XS - 2.3125 bpw	1.89	2274.67	101.77	9.7873	12514.49905
IQ2_S	llama 7B IQ2_S - 2.5 bpw	2.05	2265.55	96.16	9.1459	11619.50953
IQ2_M	llama 7B IQ2_M - 2.7 bpw	2.2	2260.25	91.42	8.6002	10921.10199
Q2_K_S	llama 7B Q2_K - Small	2.16	2266.25	93.02	9.1756	10636.43103
IQ3_XXS	llama 7B IQ3_XXS - 3.0625 bpw	2.41	2251.7	84.73	8.2039	9649.62272
Q2_K	llama 7B Q2_K - Medium	2.36	2260.54	86.31	8.6501	9557.409222
IQ2_XXS	llama 7B IQ2_XXS - 2.0625 bpw	1.73	2288.92	78.62	11.0326	9428.436439
IQ3_XS	llama 7B IQ3_XS - 3.3 bpw	2.6	2242.87	79.56	7.8787	8711.059185
IQ3_S	llama 7B IQ3_S - 3.4375 bpw	2.75	2238.22	76.3	7.7904	7971.404713
Q3_K_S	llama 7B Q3_K - Small	2.75	2241.64	75.79	8.0321	7691.587306

Conclusion

New I-Quants tend to be slower on M1 Max but faster on NVIDIA 4060 Ti 16GB though perplexities of I-Quants tend to be higher than the other quants.

On M1 Max, Q4_0 is the best quant type considering pp, tg, and ppl.

On 4060 Ti 16GB, IQ1_S is the best quant type considering pp and tg, but its perplexity doesn’t look good. With perplexity, IQ2_M is the best. If you add size here, IQ2_XS is the best.

Though it is not so fast on M1 Max, smaller I-Quants enables even models of 120B to run in 48GB of VRAM on M1 Max with 8k or 16k context length. The size of IQ2_XS quantized 120B model is 35.38GB.

Benchmarks for lots of quantization types in llama.cpp

https://beebopkim.github.io/2024/03/09/Benchmarks-for-lots-of-quantization-types-in-llama-cpp/

Author

beebopkim

Posted on

2024-03-09

Updated on

2024-03-10

Licensed under

Benchmarks for lots of quantization types in llama.cpp

Measurement Setup

Perplexity

llama-bench on M1 Max 32 GPU

Best Prompt Processing and Token Generation on M1 Max 32 GPU

Best Prompt Processing and Token Generation over Perplexity on M1 Max 32 GPU

llama-bench on NVIDIA 4060 Ti 16GB

Best Prompt Processing and Token Generation on NVIDIA 4060 Ti 16GB

Best Prompt Processing and Token Generation over Perplexity on NVIDIA 4060 Ti 16GB

Best Prompt Processing and Token Generation over Perplexity and Size on NVIDIA 4060 Ti 16GB

Conclusion

Author

Posted on

Updated on

Licensed under

Comments

Links

Recents

Archives

Tags