LLaMA-3-MERaLiON-8B-Instruct
LLaMA-3-MERaLiON-8B-Instruct is a large language model (LLM) designed to excel in multilingual understanding and instruction-following tasks. This model builds on the Llama-3-8B architecture and continue pretrained from Llama-3-8B-Base, enhanced through an extensive and meticulously curated continued pretraining process and careful merging of model weights.
- Developed by: I2R, A*STAR
- Model type: Text Decoder
- Language(s): Multilingual, primarily English, Chinese and Indonesian
- License: MERaLiON Public License + Notice: Meta Llama 3 is licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
For details on background, pre-training, tuning experiments and evaluation, please refer to our technical report.
Acknowledgement
This research is supported by the National Research Foundation, Singapore and InfocommMedia Development Authority, Singapore under its National Large Language Models Funding Initiative. The computing resources and platforms are supported by Singapore NSCC Aspire2A+ and The TPU Research Cloud. We thank all contributors and collaborators who have made this effort possible.
Model Overview
MERaLiON-LLaMA-3-8B-Instruct is primarily trained on English, Chinese, and Indonesian, with a particular emphasis on elevating its understanding and generation capabilities in Chinese and Indonesian. By integrating corpus mixing strategies developed for regional multilingual datasets, we carefully diversified the training content through domain classification, hyperparameter tuning, and replay strategies. These measures not only help the model retain knowledge without catastrophic forgetting but also significantly enhance its performance in producing high-quality, contextually accurate responses within these Southeast Asian language contexts.
Key advancements include:
- Extended Pretraining: Continued pretraining on over 120 billion tokens of primarily English, Chinese, and Indonesian text.
- SEA Multilingual Corpus Mixing: Drawing on strategies from English, Chinese, Indonesian and Malay corpora to enhance language understanding and generation capabilities.
- Domain-Diversified Pretraining Corpus: Careful selection and classification of training data from a wide range of topics and genres.
- Optimized Training Techniques: Implementing replay strategies and carefully selected hyperparameters to ensure stability, maintain quality, and avoid catastrophic forgetting.
- Instruction Tuning via Model Merging: Rather than a standard instruction-tuning pipeline, this model was derived by merging the official Llama-3.1-8B-base and Llama-3.1-8B-instruct models to produce superior instruction-following capabilities without additional supervised instruction data.
Highlights
- Enhanced Performance: MERaLiON-LLaMA-3-8B-Instruct demonstrates improved results on benchmarks including cross-MMLU, cross-LogiQA, cross-XQuAD, IndoMMLU, and CNEval, surpassing the capabilities of the official Llama-3 models.
- Extensive Multilingual Support: Strong coverage of English, Chinese, and Indonesian text, coupled with strategies inspired by Southeast Asian multilingual approaches, ensures robust understanding of and responsiveness to diverse linguistic inputs.
Model Specifications
- Model Type: Decoder
- Architecture: Llama-3.1-8B
- Context Length: 8192 tokens
- Languages: English, Chinese, Indonesian
Benchmark Performance
This benchmark analysis organizes models into LLaMA series and non-LLaMA series, providing a clear contextual framework to evaluate performance within each category while accounting for variations in baseline performance.
MERaLiON-LLaMA-3-8B-Instruct demonstrates notable advancements over official LLaMA-3 models, underscoring the effectiveness of continued pretraining strategies such as corpus mixing, replay for knowledge retention, and model merging. These techniques contribute to significant gains in multilingual reasoning, domain-specific tasks, and performance across English, Chinese and Indonesian.
To ensure fairness and consistency in evaluation, we employ a standardized benchmarking pipeline that leverages a LLM as a judge. This approach accommodates diverse model output formats, providing robust and unbiased comparisons across benchmarks.
Key highlights from the evaluations include:
Cross-MMLU, Cross-LogiQA: Enhanced reasoning and question-answering capabilities illustrate that continued pretraining improves multilingual understanding and accuracy over baseline Llama models.
IndoMMLU and CNEval: Performance boosts in Indonesian and Chinese benchmarks highlight that careful corpus mixing and replay strategies help maintain and improve language-specific strengths.
Cross-MMLU
Model Series | Model | Link | English | Chinese | Indonesian | Malay | Avg (En/Zh/Id/Ms) |
---|---|---|---|---|---|---|---|
LLaMA Series | MERaLiON-LLaMA-3-8B-Instruct | 0.847 | 0.693 | 0.713 | 0.613 | 0.717 | |
Meta-Llama-3.1-8B-Instruct | Link | 0.82 | 0.633 | 0.66 | 0.647 | 0.690 | |
Llama3-8B-CPT-SEA-LION-v2.1-Instruct | Link | 0.753 | 0.667 | 0.693 | 0.64 | 0.688 | |
Meta-Llama-3-8B-Instruct | Link | 0.767 | 0.653 | 0.573 | 0.573 | 0.642 | |
Non-LLaMA Series | GPT4o-0513 | Link | 0.927 | 0.887 | 0.88 | 0.907 | 0.900 |
Gemma-2-9B-IT | Link | 0.84 | 0.793 | 0.78 | 0.747 | 0.790 | |
Gemma2-9B-CPT-SEA-Lion-v3-Instruct | Link | 0.847 | 0.787 | 0.793 | 0.733 | 0.790 | |
Qwen2.5-7B-Instruct | Link | 0.847 | 0.84 | 0.753 | 0.713 | 0.788 | |
SeaLLMs-v3-7B-Chat | Link | 0.833 | 0.727 | 0.74 | 0.687 | 0.747 |
Cross-LogiQA
Model Series | Model | Link | English | Chinese | Indonesian | Malay | Avg (En/Zh/Id/Ms) |
---|---|---|---|---|---|---|---|
LLaMA Series | Meta-Llama-3.1-8B-Instruct | Link | 0.585 | 0.585 | 0.455 | 0.523 | 0.537 |
MERaLiON-LLaMA-3-8B-Instruct | 0.591 | 0.528 | 0.494 | 0.489 | 0.526 | ||
Meta-Llama-3-8B-Instruct | Link | 0.602 | 0.523 | 0.438 | 0.483 | 0.512 | |
Llama3-8B-CPT-SEA-LION-v2.1-Instruct | Link | 0.528 | 0.517 | 0.403 | 0.443 | 0.473 | |
Non-LLaMA Series | Qwen2.5-7B-Instruct | Link | 0.693 | 0.71 | 0.631 | 0.534 | 0.642 |
Gemma-2-9B-IT | Link | 0.659 | 0.636 | 0.585 | 0.602 | 0.621 | |
Gemma2-9B-CPT-SEA-Lion-v3-Instruct | Link | 0.636 | 0.642 | 0.557 | 0.551 | 0.597 | |
SeaLLMs-v3-7B-Chat | Link | 0.568 | 0.585 | 0.494 | 0.517 | 0.541 |
IndoMMLU
Model Series | Model | Link | Accuracy |
---|---|---|---|
LLaMA Series | MERaLiON-LLaMA-3-8B-Instruct | 0.576 | |
Llama3-8B-CPT-SEA-LION-v2.1-Instruct | Link | 0.560 | |
Meta-Llama-3.1-8B-Instruct | Link | 0.548 | |
Meta-Llama-3-8B-Instruct | Link | 0.521 | |
Non-LLaMA Series | GPT4o-0513 | Link | 0.760 |
Gemma2-9B-CPT-SEA-Lion-v3-Instruct | Link | 0.626 | |
Gemma-2-9B-IT | Link | 0.621 | |
Qwen2.5-7B-Instruct | Link | 0.582 | |
SeaLLMs-v3-7B-Chat | Link | 0.541 |
CNEval
Model Series | Model | Link | Accuracy |
---|---|---|---|
LLaMA Series | MERaLiON-LLaMA-3-8B-Instruct | 0.514 | |
Llama3-8B-CPT-SEA-LION-v2.1-Instruct | Link | 0.505 | |
Llama3-8B-CPT-SEA-Lion-v2-Instruct | Link | 0.495 | |
Meta-Llama-3-8B-Instruct | Link | 0.467 | |
Meta-Llama-3.1-8B-Instruct | Link | 0.457 | |
Non-LLaMA Series | Qwen2-7B-Instruct | Link | 0.829 |
GPT4o-0513 | Link | 0.81 | |
Qwen2.5-7B-Instruct | Link | 0.8 | |
Gemma2-9B-CPT-SEA-Lion-v3-Instruct | Link | 0.59 | |
Gemma-2-9B-IT | Link | 0.581 |
These results collectively show how the MERaLiON-LLaMA-3-8B-Instruct model builds upon the strengths of official Llama-3.1 variants. We plan to apply these techniques to continue pretrain other open-source models in future releases.
The complete evaluation results are available here, and our SeaEval benchmark paper can be accessed here.
Instruction-Following
By merging the official Llama-3.1-8B-base and Llama-3.1-8B-instruct weights, we inherit strong instruction-following behavior without additional instruction-tuning steps. The model can follow various user prompts accurately and coherently, producing well-structured, contextually relevant responses.
Usage
MERaLiON-LLaMA-3-8B-Instruct can be deployed using the 🤗 Transformers library. With careful device mapping and dtype settings, users can achieve efficient and high-quality text generation.
Example:
import transformers
import torch
model_id = "MERaLiON/MERaLiON-LLaMA-3-8B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "user", "content": "What is the sentiment of the following sentence?\nSentence: This book is incredibly dull.\nAnswer:"},
]
outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
Note: We use same chat format as official llama-3.1-8b-instruct.
Caveats and Limitations
Like many LLMs, MERaLiON-LLaMA-3-8B-Instruct may hallucinate or produce irrelevant or incorrect content. While we have taken steps to mitigate these issues, users are advised to critically evaluate outputs, especially in high-stakes applications. The model has not undergone explicit safety alignment and filtering; users should implement their own safeguards, content moderation, and evaluation strategies.
Safety and Liability
This model is not strongly safety-aligned. Users are responsible for implementing their own safety checks and mitigations. The authors and affiliated institutions are not liable for any damages or losses arising from the use of this model.
Technical Specifications
MERaLiON-LLaMA-3-8B-Instruct underwent continued pretraining using computational resources provided by Singapore NSCC Aspire2A+ and The TPU Research Cloud. We utilized diverse data sources and adaptive strategies to ensure stable training without catastrophic forgetting.
Compute and Training Platform
The training of MERaLiON-LLaMA-3-8B-Instruct was conducted using the MaxText platform, leveraging both NVIDIA H100 GPUs and TPU v4-128 chips. Specifically, we utilized 64 H100 GPUs, achieving approximately 400 TFLOPS per GPU, and TPU v4-128 configurations, attaining around 168 TFLOPS per TPU chip. These performance metrics were realized through optimized sharding, checkpoint strategies, and the selection of optimal batch sizes, ensuring efficient and effective model training.
Call for Contributions
We invite researchers, developers, and community members to contribute by:
- Identifying and reporting issues or biases.
- Providing additional pretraining or instruction data.
- Suggesting enhancements to documentation or evaluation metrics.
- Extending the model to support additional languages or domains.
Please visit our repository for more information and contribution guidelines.
Disclaimer
This repository contains the weights for a model not specifically aligned for safety. Users are advised to perform their own due diligence, safety fine-tuning, and compliance measures. The authors disclaim liability for any direct or indirect damages resulting from model use.
- Downloads last month
- 16