Raincleared
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -18,6 +18,18 @@ license: apache-2.0
|
|
18 |
- Adapted LLaMA version: [MiniCPM-S-1B-sft-llama-format](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-llama-format/)
|
19 |
- Adapted PowerInfer version: [MiniCPM-S-1B-sft-gguf](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf)
|
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
### Introduction
|
22 |
|
23 |
The utilization of activation sparsity, namely the existence of considerable weakly-contributed elements among activation outputs, is a promising method for inference acceleration of large language models (LLMs) ([Liu et al., 2023](https://proceedings.mlr.press/v202/liu23am/liu23am.pdf); [Song et al., 2023](https://arxiv.org/pdf/2312.12456.pdf)). Concretely, acceleration methods based on activation sparsity usually achieve higher inference speed by making wiser resource allocation and computation policies to avoid resource waste on these weakly-contributed parameters.
|
|
|
18 |
- Adapted LLaMA version: [MiniCPM-S-1B-sft-llama-format](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-llama-format/)
|
19 |
- Adapted PowerInfer version: [MiniCPM-S-1B-sft-gguf](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf)
|
20 |
|
21 |
+
### Chat Template
|
22 |
+
|
23 |
+
To make the model sophisticatedly respond to a query, it is recommended to use a standard chat prompt, such as:
|
24 |
+
|
25 |
+
```
|
26 |
+
<用户>{prompt}<AI>
|
27 |
+
```
|
28 |
+
|
29 |
+
where `prompt` is the query text, while `<用户>` and `<AI>` are prompt tokens.
|
30 |
+
|
31 |
+
Also, make sure that you have **a bos token `<s>` at the beginning of any input**, or the model can sometimes behave improperly.
|
32 |
+
|
33 |
### Introduction
|
34 |
|
35 |
The utilization of activation sparsity, namely the existence of considerable weakly-contributed elements among activation outputs, is a promising method for inference acceleration of large language models (LLMs) ([Liu et al., 2023](https://proceedings.mlr.press/v202/liu23am/liu23am.pdf); [Song et al., 2023](https://arxiv.org/pdf/2312.12456.pdf)). Concretely, acceleration methods based on activation sparsity usually achieve higher inference speed by making wiser resource allocation and computation policies to avoid resource waste on these weakly-contributed parameters.
|