Update README.md
Browse files
README.md
CHANGED
@@ -50,6 +50,19 @@ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonl
|
|
50 |
|
51 |
## Tokenizer Details
|
52 |
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
## Uses
|
55 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
|
|
50 |
|
51 |
## Tokenizer Details
|
52 |
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
|
53 |
+
## Evaluation
|
54 |
+
|
55 |
+
|| SambaLingo-Arabic-Base | Jais-13b | bloomz-7 | xglm-7.5 | mGPT-13B |
|
56 |
+
|------------------------------|----------|----------|----------|----------|--------|
|
57 |
+
| Perplexity (Lower Is Better) | **1.422** | 1.504 | 1.578 | 1.623 | 2.066 |
|
58 |
+
| FLORES en->ar (8 shot, CHRF) | **0.501** | 0.476 | 0.259 | 0.415 | 0.138 |
|
59 |
+
| FLORES ar->en (8 shot, CHRF) | **0.610** | 0.584 | 0.176 | 0.133 | 0.141 |
|
60 |
+
| FLORES en->ar (8 shot, BLEU) | **0.169** | | 0.011 | 0.009 | 0.003 |
|
61 |
+
| FLORES ar->en (8 shot, BLEU) | **0.339** | | 0.036 | 0.153 | 0.005 |
|
62 |
+
| Belebele (3 shot) | **39.00%** | 34.40% | 29.00% | 21.89% | 23.67% |
|
63 |
+
| SIB-200 (3 shot) | 71.57% | **76.47%** | 63.24% | 65.20% | 46.57% |
|
64 |
+
| XNLI (0 shot) | 33.57% | **36.33%** | 33.79% | 33.37% | 33.43% |
|
65 |
+
| XStoryCloze (0 shot) | **66.25%** | 63.34% | 58.50% | 56.19% | 51.62% |
|
66 |
|
67 |
## Uses
|
68 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|