Raincleared commited on
Commit
f5babc0
·
verified ·
1 Parent(s): 322b1b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -18,6 +18,18 @@ license: apache-2.0
18
  - Adapted LLaMA version: [MiniCPM-S-1B-sft-llama-format](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-llama-format/)
19
  - Adapted PowerInfer version: [MiniCPM-S-1B-sft-gguf](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf)
20
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ### Introduction
22
 
23
  The utilization of activation sparsity, namely the existence of considerable weakly-contributed elements among activation outputs, is a promising method for inference acceleration of large language models (LLMs) ([Liu et al., 2023](https://proceedings.mlr.press/v202/liu23am/liu23am.pdf); [Song et al., 2023](https://arxiv.org/pdf/2312.12456.pdf)). Concretely, acceleration methods based on activation sparsity usually achieve higher inference speed by making wiser resource allocation and computation policies to avoid resource waste on these weakly-contributed parameters.
 
18
  - Adapted LLaMA version: [MiniCPM-S-1B-sft-llama-format](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-llama-format/)
19
  - Adapted PowerInfer version: [MiniCPM-S-1B-sft-gguf](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf)
20
 
21
+ ### Chat Template
22
+
23
+ To make the model sophisticatedly respond to a query, it is recommended to use a standard chat prompt, such as:
24
+
25
+ ```
26
+ <用户>{prompt}<AI>
27
+ ```
28
+
29
+ where `prompt` is the query text, while `<用户>` and `<AI>` are prompt tokens.
30
+
31
+ Also, make sure that you have **a bos token `<s>` at the beginning of any input**, or the model can sometimes behave improperly.
32
+
33
  ### Introduction
34
 
35
  The utilization of activation sparsity, namely the existence of considerable weakly-contributed elements among activation outputs, is a promising method for inference acceleration of large language models (LLMs) ([Liu et al., 2023](https://proceedings.mlr.press/v202/liu23am/liu23am.pdf); [Song et al., 2023](https://arxiv.org/pdf/2312.12456.pdf)). Concretely, acceleration methods based on activation sparsity usually achieve higher inference speed by making wiser resource allocation and computation policies to avoid resource waste on these weakly-contributed parameters.