Qwen
/

yangapku commited on
Commit
2468096
·
1 Parent(s): 8857f77

update int8 quantization info

Browse files
Files changed (1) hide show
  1. README.md +32 -19
README.md CHANGED
@@ -164,40 +164,53 @@ response, history = model.chat(tokenizer, "你好", history=None)
164
 
165
  ### 效果评测
166
 
167
- 我们对BF16和Int4模型在基准评测上做了测试(使用zero-shot设置),发现量化模型效果损失较小,结果如下所示:
168
 
169
- We illustrate the zero-shot performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
170
 
171
- | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
172
- |--------------|:----:|:-----------:|:-----:|:---------:|
173
- | BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
174
- | Int4 | 55.1 | 59.2 | 49.7 | 35.4 |
 
175
 
176
  ### 推理速度 (Inference Speed)
177
 
178
- 我们测算了BF16和Int4模型生成2048和8192个token的平均推理速度。如图所示:
179
 
180
- We measured the average inference speed of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization level, respectively.
181
 
182
- | Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
183
- |--------------|:-------------------:|:-------------------:|
184
- | BF16 | 30.53 | 28.51 |
185
- | Int4 | 45.60 | 33.83 |
 
 
 
 
 
 
 
186
 
187
- 具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
188
 
189
- In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
 
 
 
 
190
 
191
  ### 显存使用 (GPU Memory Usage)
192
 
193
- 我们还测算了BF16和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果如下所示:
194
 
195
- We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
196
 
197
  | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
198
- |--------------------|:-----------------------------------:|:-------------------------------------:|
199
- | BF16 | 18.99GB | 24.40GB |
200
- | Int4 | 10.20GB | 15.61GB |
 
201
 
202
  上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
203
 
 
164
 
165
  ### 效果评测
166
 
167
+ 我们对BF16,Int8和Int4模型在基准评测上做了测试(使用zero-shot设置),发现量化模型效果损失较小,结果如下所示:
168
 
169
+ We illustrate the zero-shot performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
170
 
171
+ | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
172
+ | ------------- | :--------: | :----------: | :----: | :--------: |
173
+ | BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
174
+ | Int8 | 55.4 | 59.4 | 48.3 | 34.8 |
175
+ | Int4 | 55.1 | 59.2 | 49.7 | 29.9 |
176
 
177
  ### 推理速度 (Inference Speed)
178
 
179
+ 我们测算了不同精度模型以及不同FlashAttn库版本下模型生成2048和8192个token的平均推理速度。如图所示:
180
 
181
+ We measured the average inference speed of generating 2048 and 8192 tokens with different quantization levels and versions of flash-attention, respectively.
182
 
183
+ | Quantization | FlashAttn | Speed (2048 tokens) | Speed (8192 tokens) |
184
+ | ------------- | :-------: | :------------------:| :------------------:|
185
+ | BF16 | v2 | 40.93 | 36.14 |
186
+ | Int8 | v2 | 37.47 | 32.54 |
187
+ | Int4 | v2 | 50.09 | 38.61 |
188
+ | BF16 | v1 | 40.75 | 35.34 |
189
+ | Int8 | v1 | 37.51 | 32.39 |
190
+ | Int4 | v1 | 45.98 | 36.47 |
191
+ | BF16 | Disabled | 37.55 | 33.56 |
192
+ | Int8 | Disabled | 37.84 | 32.65 |
193
+ | Int4 | Disabled | 48.12 | 36.70 |
194
 
195
+ 具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.8。推理速度是生成8192个token的速度均值。
196
 
197
+ In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 8192 tokens.
198
+
199
+ 注意:以上Int4/Int8模型生成速度使用autogptq库给出,当前``AutoModelForCausalLM.from_pretrained``载入的模型生成速度会慢大约20%。我们已经将该问题汇报给HuggingFace团队,若有解决方案将即时更新。
200
+
201
+ Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
202
 
203
  ### 显存使用 (GPU Memory Usage)
204
 
205
+ 我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示:
206
 
207
+ We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under different quantization levels, respectively. The GPU memory usage is similar when using flash-attention or not.)The results are shown below.
208
 
209
  | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
210
+ | ------------------ | :---------------------------------: | :-----------------------------------: |
211
+ | BF16 | 16.99GB | 22.53GB |
212
+ | Int8 | 11.20GB | 16.62GB |
213
+ | Int4 | 8.21GB | 13.63GB |
214
 
215
  上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
216