RichardErkhov commited on
Commit
9028279
·
verified ·
1 Parent(s): ce58fc9

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ SambaLingo-Arabic-Base-70B - GGUF
11
+ - Model creator: https://huggingface.co/sambanovasystems/
12
+ - Original model: https://huggingface.co/sambanovasystems/SambaLingo-Arabic-Base-70B/
13
+
14
+
15
+ | Name | Quant method | Size |
16
+ | ---- | ---- | ---- |
17
+ | [SambaLingo-Arabic-Base-70B.Q2_K.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.Q2_K.gguf) | Q2_K | 23.94GB |
18
+ | [SambaLingo-Arabic-Base-70B.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.IQ3_XS.gguf) | IQ3_XS | 26.61GB |
19
+ | [SambaLingo-Arabic-Base-70B.IQ3_S.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.IQ3_S.gguf) | IQ3_S | 28.11GB |
20
+ | [SambaLingo-Arabic-Base-70B.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.Q3_K_S.gguf) | Q3_K_S | 28.11GB |
21
+ | [SambaLingo-Arabic-Base-70B.IQ3_M.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.IQ3_M.gguf) | IQ3_M | 29.06GB |
22
+ | [SambaLingo-Arabic-Base-70B.Q3_K.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.Q3_K.gguf) | Q3_K | 31.23GB |
23
+ | [SambaLingo-Arabic-Base-70B.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.Q3_K_M.gguf) | Q3_K_M | 31.23GB |
24
+ | [SambaLingo-Arabic-Base-70B.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.Q3_K_L.gguf) | Q3_K_L | 33.91GB |
25
+ | [SambaLingo-Arabic-Base-70B.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.IQ4_XS.gguf) | IQ4_XS | 34.9GB |
26
+ | [SambaLingo-Arabic-Base-70B.Q4_0.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.Q4_0.gguf) | Q4_0 | 36.47GB |
27
+ | [SambaLingo-Arabic-Base-70B.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.IQ4_NL.gguf) | IQ4_NL | 36.82GB |
28
+ | [SambaLingo-Arabic-Base-70B.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/blob/main/SambaLingo-Arabic-Base-70B.Q4_K_S.gguf) | Q4_K_S | 36.82GB |
29
+ | [SambaLingo-Arabic-Base-70B.Q4_K.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q4_K | 38.85GB |
30
+ | [SambaLingo-Arabic-Base-70B.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q4_K_M | 38.85GB |
31
+ | [SambaLingo-Arabic-Base-70B.Q4_1.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q4_1 | 40.48GB |
32
+ | [SambaLingo-Arabic-Base-70B.Q5_0.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q5_0 | 44.49GB |
33
+ | [SambaLingo-Arabic-Base-70B.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q5_K_S | 44.49GB |
34
+ | [SambaLingo-Arabic-Base-70B.Q5_K.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q5_K | 45.7GB |
35
+ | [SambaLingo-Arabic-Base-70B.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q5_K_M | 45.7GB |
36
+ | [SambaLingo-Arabic-Base-70B.Q5_1.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q5_1 | 48.51GB |
37
+ | [SambaLingo-Arabic-Base-70B.Q6_K.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q6_K | 53.02GB |
38
+ | [SambaLingo-Arabic-Base-70B.Q8_0.gguf](https://huggingface.co/RichardErkhov/sambanovasystems_-_SambaLingo-Arabic-Base-70B-gguf/tree/main/) | Q8_0 | 68.67GB |
39
+
40
+
41
+
42
+
43
+ Original model description:
44
+ ---
45
+ license: llama2
46
+ datasets:
47
+ - uonlp/CulturaX
48
+ language:
49
+ - ar
50
+ - en
51
+ metrics:
52
+ - chrf
53
+ - accuracy
54
+ - bleu
55
+ ---
56
+
57
+
58
+
59
+ # SambaLingo-Arabic-Base-70B
60
+
61
+ <img src="SambaLingo_Logo.png" width="340" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
62
+
63
+ <!-- Provide a quick summary of what the model is/does. -->
64
+ SambaLingo-Arabic-Base-70B is a pretrained Bi-lingual Arabic and English model that adapts [Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b-hf) to Arabic by training on 28 billion tokens from the Arabic split of the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset. This model reports state of the art evaluation results in perplexity and FLORES-200 translation. For the chat version of this model, please see [sambanovasystems/SambaLingo-Arabic-Chat](https://huggingface.co/sambanovasystems/SambaLingo-Arabic-Chat-70B), or try it out at [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
65
+
66
+
67
+ ## Model Description
68
+ <!-- Provide a longer summary of what this model is. -->
69
+
70
+ - **Developed by:** [SambaNova Systems](https://sambanova.ai/)
71
+ - **Model type:** Language Model
72
+ - **Language(s):** Arabic, English
73
+ - **Finetuned from model:** [Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b-hf)
74
+ - **Paper:** [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
75
+ - **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
76
+
77
+ ## Getting Started
78
+
79
+ ### Loading Model With Hugging Face
80
+ ```python
81
+ from transformers import AutoModelForCausalLM, AutoTokenizer
82
+
83
+ tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Arabic-Base-70B")
84
+ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Arabic-Base-70B", device_map="auto", torch_dtype="auto")
85
+ ```
86
+
87
+ ### Suggested Inference Parameters
88
+ We suggest setting do_sample=False as this is a pretrained checkpoint.
89
+
90
+ ### Prompting Guidelines
91
+ This model is a pretrained checkpoint, so to use it effectively please use few shot prompting with exemplars. The only other prompt templating required is the standard \<s\> (BOS) token from the Llama tokenizer. If you want to interact with this model with direct questions or queries, please use the chat version of the model that has been aligned with human preferences [sambanovasystems/SambaLingo-Arabic-Chat](https://huggingface.co/sambanovasystems/SambaLingo-Arabic-Chat).
92
+
93
+ ## Training Details
94
+ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset. We mix the data to be 75% data from the language we are adapting to, and 25% English as suggested by [Csaki et al.](https://arxiv.org/abs/2311.05741) We pack the data into sequences of length 4096, and ensure that when learning a token we only attend to previous tokens in the context of the corresponding text document. We train with a global batch size of 1024, sequence length of 4096, maximum learning rate of 1e-4 with cosine decay, warmup ratio of 0.01 and a weight decay of 0.1.
95
+
96
+ ## Tokenizer Details
97
+ We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
98
+
99
+ ## Evaluation
100
+ For evaluation results see our paper: [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
101
+
102
+ ## Uses
103
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
104
+
105
+ ### Direct Use
106
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
107
+ Use of this model is governed by the Meta’s [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/). Please review and accept the license before downloading the model weights.
108
+
109
+
110
+ ### Out-of-Scope Use
111
+
112
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
113
+ SambaLingo should NOT be used for:
114
+
115
+ - Mission-critical applications
116
+ - Applications that involve the safety of others
117
+ - Making highly important decisions
118
+
119
+ ## Bias, Risks, and Limitations
120
+
121
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
122
+
123
+ Like all LLMs, SambaLingo has certain limitations:
124
+ - Hallucination: Model may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information.
125
+ - Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output.
126
+ - Repetition: The Model may produce repetitive phrases or sentences, leading to less engaging and informative responses.
127
+ - Coding and Math: The model's performance in generating accurate code or solving complex mathematical problems may be limited.
128
+ - Toxicity: The model could inadvertently generate responses containing inappropriate or harmful content.
129
+
130
+ ## Acknowledgments
131
+ We extend our heartfelt gratitude to the open-source AI community; this endeavor would not have been possible without open source. SambaNova embraces the open-source community and aspires to actively contribute to this initiative.
132
+
133
+ We would like to give a special thanks to the following groups:
134
+ - Meta for open sourcing LLama 2 and open sourcing FLORES-200 dataset
135
+ - Nguyen et al for open sourcing CulturaX dataset
136
+ - CohereAI for releasing AYA-101 and open sourcing a multilingual instruction tuning dataset
137
+ - EleutherAI for their open source evaluation framework
138
+ - Hugging Face-H4 team for open source the zephyr training recipe and alignment handbook repo
139
+
140
+
141
+ ## Cite SambaLingo
142
+ ```
143
+ @misc{csaki2024sambalingo,
144
+ title={SambaLingo: Teaching Large Language Models New Languages},
145
+ author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
146
+ year={2024},
147
+ eprint={2404.05829},
148
+ archivePrefix={arXiv},
149
+ primaryClass={cs.CL}
150
+ }
151
+ ```
152
+