Opensource Low Resource Language Datasets to Supervised Finetune Language Models
MyanmarGPT was released in December 2023. After that, many questions have been in the community for requesting the SFT datasets to finetune the language models further.
Supervised finetuning (SFT) is a technique where a pre-trained model is further trained on a labeled dataset specific to a task to improve its performance on that task. This method leverages the model’s existing knowledge, acquired during pretraining, and adapts it to a more specific domain or problem.
In the year 2024, I released datasets in general and specific domains to finetune the instruction model. Here is the list of the collection of datasets.
Burmese Microbiology 1K Dataset
Link - https://huggingface.co/datasets/jojo-ai-mst/Burmese-Microbiology-1K
Paper - Burmese Microbiology 1K Dataset
The Burmese microbiology 1K dataset is a domain-specific knowledge dataset in microbiology. The dataset includes microbiology culture media and microbes, including bacteria, viruses, fungi, and parasites. The dataset was intended not only for finetuning the language model but also can be used for building RAG - Retrieval Augmented Generation powered applications in public health-related applications.
The dataset contains 1263 rows of questions and answers on microbiology in the Burmese language.
Myanmar Agriculture 1K Dataset
Link - https://huggingface.co/datasets/jojo-ai-mst/Myanmar-Agricutlure-1K
Myanmar Agriculture 1K Dataset is also a domain-specific knowledge dataset in the agriculture of Myanmar. The dataset includes how to grow plants and trees according to weather and soil conditions in Myanmar, climate changes, horticulture, and how to reduce carbon emissions.
The dataset contains 1053 rows of questions and answers.
Mpox Myanmar
Link - https://huggingface.co/datasets/jojo-ai-mst/Mpox-Myanmar
Mpox Myanmar is a dataset for a specific virus called Mpox. Year 2024, Mpox was a WHO-alerted outbreak throughout the world. Thus, to provide information on Mpox, the dataset was curated on the WHO articles and MM Gov website articles.
The dataset contains 99 rows of questions, answers, and metadata.
Roleplay-Burmese
Link - https://huggingface.co/datasets/jojo-ai-mst/Roleplay-Burmese
Roleplay-Burmese is a part of the collection of Multilingual Roleplay dataset. The multilingual roleplay dataset is a collection of datasets for roleplaying in different low resource languages. These include languages in Southeast Asian countries, African countries, and other low-resource languages around the world.
The original roleplay dataset is the GPTeacher roleplay dataset by teknium 1, which is translated into many languages by the Google translate engine, and released under MIT License for academic and research purposes.
The dataset contains 1923 rows of instruction, input, and response.
Multilingual Roleplay
Link - https://huggingface.co/collections/jojo-ai-mst/multilingual-roleplay-66f91668cb7628aaef4af6ed
The idea started from the Roleplay-Burmese dataset. Many languages have low resources in the world. Thus, roleplay datasets need to be curated for those languages too.
This collection of datasets is about roleplaying in low-resource languages. Languages included are
- Burmese (my)
- Lao (lo)
- Khmer (khm)
- Malay (ms)
- Vietnam (vi)
- Thai (th)
- Hindi (hi)
- Indonesian (id)
- Filipino (fil)
- Bengali (bn)
- Afrikaans (af)
- Albanian (sq)
- Amharic (am)
- Georgian (ka)
- Irish (ga)
- Zulu (zu)
- Serbian (sr)
- Kinyarwanda (rw)
- Somali (so)
- Kurdish (ku)
- Huasa (ha)
- Icelandic (is)
- Nepali (ne)
- Panjabi/Punjabi (pa)
- Tamil (ta)
- Yiddish (yi)
- Hebrew (he)
- Azerbaijani (az)
- Kazakh (kk)
- Cebuano (ceb)
More languages to add to this collection are
- Turkish (tr)
- Finnish (fin)
- Czech (cs)
- Norwegian (no)
- Mongolian (mn)
- Lithuanian (lt)
Rakhine Proverbs
Link - https://huggingface.co/datasets/jojo-ai-mst/Rakhine-Proverbs
Rakhine/Arakan language is a language in the Rakhine state of Myanmar. It is a low-resource language. The dataset was released under a public domain license. The proverbs are summarized and extracted from "ဥပမာစုံ၊ ရခိုင်စကားပုံ။" ကျမ်း by "အရှင်စက္ကိန္ဒ, အရှင်ဝါသဝ" published in 1996, August.
The dataset contains 221 rows of proverbs in Rakhine language.
MyanmarGPT-Movement
These datasets were released under "myanmargpt-movement" 2024 year activities.