|
--- |
|
license: other |
|
license_name: nvidia-open-model-license |
|
license_link: >- |
|
https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf |
|
base_model: |
|
- meta-llama/Llama-2-13b |
|
tags: |
|
- nvidia |
|
- llama 2 |
|
- pytorch |
|
- kvcache |
|
library_name: megatron-lm |
|
--- |
|
|
|
# Llama-2-13B-DMC-4x |
|
|
|
## Description |
|
|
|
Llama-2-13B-DMC-4x is a version of [Llama 2 13B](https://www.llama.com/llama2/), which has been trained to apply the Dynamic Memory Compression (DMC) algorithm ([https://arxiv.org/abs/2403.09636](https://arxiv.org/abs/2403.09636)). With DMC, the model performs on-line key–value cache compression at inference time, achieving substantially better throughput and/or latency. Most importantly, it learns to apply different compression ratios in different heads and layers. The source code for training and inference is provided in the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/dmc) repository. |
|
|
|
This model is for research and development only. |
|
|
|
### License |
|
|
|
GOVERNING TERMS: This model is governed by the NVIDIA Open Model License Agreement (found at https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf). <br> |
|
Additional Information: LLAMA 2 COMMUNITY LICENSE AGREEMENT (found at https://huggingface.co/meta-llama/Llama-2-13b/blob/main/LICENSE.txt). |
|
|
|
## Reference |
|
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference |
|
|
|
## Model Architecture |
|
|
|
Llama-2-13B-DMC-4x uses a model embedding size of 5120, 40 attention heads, MLP intermediate dimension of 13824, with 40 layers in total. Additionally, it uses Rotary Position Embeddings (RoPE). |
|
|
|
**Architecture Type:** Transformer Decoder (Auto-regressive Language Model) |
|
|
|
**Network Architecture:** Llama 2 13B |
|
|
|
## Input |
|
**Input Type:** Text <br> |
|
**Input Format:** String <br> |
|
**Input Parameters:** One Dimensional (1D), Temperature |
|
**Other Properties Related to Input: Max Input Tokens: 4096 <br> |
|
|
|
## Output |
|
**Output Type :** Text <br> |
|
**Output Format:** String <br> |
|
**Output Parameters:** One Dimensional (1D) <br> |
|
**Other Properties Related to Output: Max Output Tokens: 4096 <br> |
|
|
|
## Software Integration |
|
**Runtime Engine(s):** |
|
* Not Applicable (N/A) |
|
|
|
The model weights are distributed in bfloat16 format. However, it could be converted to other formats in order to run on other hardware microarchitectures. |
|
|
|
**Supported Hardware Microarchitecture Compatibility:** Nvidia Ampere and newer GPUs.<br> |
|
|
|
**Supported Operating System(s):** <br> |
|
* Linux <br> |
|
|
|
## Model Version(s) |
|
Llama 2 13B DMC 4x v1.0 |
|
|
|
# Training and Evaluation Datasets |
|
|
|
## Training Dataset |
|
|
|
The model was trained for 18,000 steps with a batch size of 1024, a sequence length of 4096, and a learning rate of 3e-5 with an increasing compression objective. Afterwards, it underwent additional training for 2000 steps with a fixed compression rate of 4x and a smaller learning rate of 3e-6. |
|
|
|
NVIDIA models are trained on a diverse set of public and proprietary datasets. This particular model was trained on a dataset containing a mixture of texts in English and 37 programming languages. |
|
|
|
## Evaluation |
|
|
|
| Category | Benchmark | # Shots | Llama 2 13B | Llama 2 13B DMC 4x | |
|
|:------------|:--------------------------------------------|--------:|-----------:|------------------:| |
|
| General | [MMLU](https://openreview.net/forum?id=d7KBjmI3GmQ) | 5 | 55.2 | 54.2 | |
|
| Math | GMS8K | 5 | 22.9 | 22.6 | |
|
| Commonsense | [HellaSwag](https://aclanthology.org/P19-1472) | 10 | 82.1 | 82.4 | |
|
| Commonsense | [Arc-Easy](https://arxiv.org/abs/1803.05457) | 0 | 76.4 | 75.6 | |
|
| Commonsense | [Arc-Challenge](https://arxiv.org/abs/1803.05457) | 25 | 59.8 | 57.7 | |
|
| Commonsense | [PIQA](https://ojs.aaai.org/index.php/AAAI/article/view/6239) | 0 | 80.4 | 81.1 | |
|
| Commonsense | [WinoGrande](https://ojs.aaai.org/index.php/AAAI/article/view/6399) | 5 | 77.3 | 76.1 | |
|
|
|
## AI Safety Efforts |
|
|
|
The Llama-2-13B-DMC-4x model underwent AI safety evaluation including adversarial testing via three distinct methods: |
|
-[Garak](https://github.com/leondz/garak), is an automated LLM vulnerability scanner that probes for common weaknesses, including prompt injection and data leakage. |
|
-[AEGIS](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0), is a content safety evaluation dataset and LLM based content safety classifier model, that adheres to a broad taxonomy of 13 categories of critical risks in human-LLM interactions. |
|
-Human Content Red Teaming leveraging human interaction and evaluation of the models' responses. |
|
|
|
## Inference |
|
|
|
**Engine:** Megatron-LM <br> |
|
**Test Hardware** H100-80GB <br> |
|
|
|
We recommend running the provided code inside a [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). |
|
|
|
1. First, download a [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) using Docker. |
|
The code below has been tested with the `24.04-py3` version of the container. |
|
|
|
2. After setting up the container, clone the repository and install the dependencies: |
|
``` |
|
git clone -b dmc https://github.com/NVIDIA/Megatron-LM |
|
cd Megatron-LM |
|
pip install -r requirements.txt |
|
``` |
|
3. Download the [Llama 2 tokenizer](https://huggingface.co/meta-llama/Llama-2-7b/blob/main/tokenizer.model) and save it under a desired location `<TOKENIZER_MODEL>`. |
|
|
|
4. Download a selected checkpoint and save it under a desired location `<DMC_MODEL>`. |
|
|
|
5. We provide code to run and benchmark a simple, auto-regressive inference. Save a single prompt in a textfile and run: |
|
```bash |
|
./examples/dmc/inference.sh 13B <DMC_MODEL> <TOKENIZER_MODEL> <PROMPT_TXT_FILE> |
|
``` |
|
|
|
## Ethical Considerations |
|
|
|
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. |
|
|
|
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). |
|
|
|
## Limitations |
|
|
|
The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. This issue could be exacerbated without the use of the recommended prompt template. If you are going to use this model in an agentic workflow, validate that the imported packages are from a trusted source to ensure end-to-end security. |
|
|
|
## Citation |
|
|
|
If you find this model useful, please cite the following works |
|
|
|
```bibtex |
|
@InProceedings{pmlr-v235-nawrot24a, |
|
title = {Dynamic Memory Compression: Retrofitting {LLM}s for Accelerated Inference}, |
|
author = {Nawrot, Piotr and {\L}a\'{n}cucki, Adrian and Chochowski, Marcin and Tarjan, David and Ponti, Edoardo}, |
|
booktitle = {Proceedings of the 41st International Conference on Machine Learning}, |
|
pages = {37396--37412}, |
|
year = {2024}, |
|
editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, |
|
volume = {235}, |
|
series = {Proceedings of Machine Learning Research}, |
|
month = {21--27 Jul}, |
|
publisher = {PMLR}, |
|
pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/nawrot24a/nawrot24a.pdf}, |
|
url = {https://proceedings.mlr.press/v235/nawrot24a.html}, |
|
abstract = {Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key–value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key–value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to $\sim 3.7 \times$ throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4$\times$ cache compression, outperforming up-trained grouped-query attention (GQA) and key–value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget. We release the DMC code and models at https://github.com/NVIDIA/Megatron-LM/tree/DMC.} |
|
} |
|
``` |