Update README.md
Browse files
README.md
CHANGED
@@ -51,6 +51,20 @@ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonl
|
|
51 |
## Tokenizer Details
|
52 |
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
|
53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
## Uses
|
55 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
56 |
|
|
|
51 |
## Tokenizer Details
|
52 |
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
|
53 |
|
54 |
+
## Evaluation
|
55 |
+
|
56 |
+
| | SambaLingo-Thai-Base | typhoon-7b | bloom-7b1 | xglm-7.5B | mGPT-13B |
|
57 |
+
|------------------------------|------------|-----------|-----------|----------|--------|
|
58 |
+
| Perplexity (Lower Is Better) | **1.288** | 1.373 | 1.834 | 1.394 | 1.966 |
|
59 |
+
| FLORES en->th (8 shot, CHRF) | **0.433** | 0.347 | 0.095 | 0.198 | 0.032 |
|
60 |
+
| FLORES th->en (8 shot, CHRF) | **0.536** | 0.465 | 0.138 | 0.431 | 0.016 |
|
61 |
+
| FLORES en->th (8 shot, BLEU) | **0.019** | 0.004 | 0.000 | 0.003 | 0.000 |
|
62 |
+
| FLORES th->en (8 shot, BLEU) | **0.247** | 0.188 | 0.003 | 0.147 | 0.000 |
|
63 |
+
| Belebele (3 shot) | 37.11% | **52.22%** | 24.11% | 22.44% | 26.89% |
|
64 |
+
| SIB-200 (3 shot) | 62.25% | **75.49%** | 23.04% | 63.73% | 44.12% |
|
65 |
+
| XCOPA (0 shot) | **61.40%** | 60.60% | 55.40% | 59.40% | 52.80% |
|
66 |
+
| XNLI (0 shot) | **44.65%** | 43.01% | 34.87% | 43.73% | 39.24% |
|
67 |
+
|
68 |
## Uses
|
69 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
70 |
|