|
--- |
|
language: |
|
- id |
|
license: other |
|
license_name: tongyi-qianwen |
|
--- |
|
|
|
# Bahasa-4b Model Report |
|
|
|
## Model Name |
|
**Bahasa-4b** |
|
|
|
## Model Detail |
|
Bahasa-4b is continued training from qwen-4b using 10 billion high quality text of Indonesian. The model outperforms some 4b, and even 7b models for Indonesian tasks. |
|
|
|
## Model Developers |
|
Bahasa AI |
|
|
|
## Intended Use |
|
This model is intended for various NLP tasks that require understanding and generating Indonesian language. It is suitable for applications such as question answering, sentiment analysis, document summarization, and more. |
|
|
|
## Training Data |
|
Bahasa-4b was trained on a 10 billion subset data of Indonesian dataset from a collected pool of 100 billion. |
|
|
|
## Benchmarks |
|
The following table shows the performance of Bahasa-4b compared to the models Sailor_4b and Mistral-7B-v0.1 across several benchmarks: |
|
|
|
| Dataset | Version | Metric | Mode | Sailor_4b | Bahasa-4b-hf | Mistral-7B-v0.1 | |
|
|----------------|---------|--------|------|-----------|--------------|-----------------| |
|
| tydiqa-id | 0e9309 | EM | gen | 53.98 | 55.04 | 63.54 | |
|
| tydiqa-id | 0e9309 | F1 | gen | 73.48 | 75.39 | 78.73 | |
|
| xcopa-id | 36c11c | EM | ppl | 69.2 | 73.2 | 62.40 | |
|
| xcopa-id | 36c11c | F1 | ppl | 69.2 | 73.2 | - | |
|
| m3exam-id-ppl | ede415 | EM | ppl | 31.27 | 44.47 | 26.68 | |
|
| belebele-id-ppl| 7fe030 | EM | ppl | 41.33 | 42.33 | 41.33 | |
|
|
|
This data demonstrates that Bahasa-4b consistently outperforms the Sailor_4b model in various Indonesian language tasks, showing improvements in both EM (Exact Match) and F1 scores across different datasets, and is competitive with the Mistral-7B-v0.1 model. |