--- datasets: - csebuetnlp/xlsum language: - am - ar - az - bn - my - zh - en - fr - gu - ha - hi - ig - id - ja - rn - ko - ky - mr - ne - om - ps - fa - pcm - pt - pa - ru - gd - sr - si - so - es - sw - ta - te - th - ti - tr - uk - ur - uz - vi - cy - yo multilinguality: - multilingual pipeline_tag: summarization --- # Model Card for Model ID This model is fine-tuned version of [DeltaLM-base](https://huggingface.co/nguyenvulebinh/deltalm-base) on the [XLSum dataset](https://huggingface.co/datasets/csebuetnlp/xlsum) , aiming for abstractive multilingual summarization. It achieves the following results on the evaluation set: - rouge-1: 18.2 - rouge-2: 7.6 - rouge-l: 14.9 - rouge-lsum: 14.7 ## Dataset desctiption [XLSum dataset](https://huggingface.co/datasets/csebuetnlp/xlsum) is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 45 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. ## Languages - amharic - arabic - azerbaijani - bengali - burmese - chinese_simplified - chinese_traditional - english - french - gujarati - hausa - hindi - igbo - indonesian - japanese - kirundi - korean - kyrgyz - marathi - nepali - oromo - pashto - persian - pidgin - portuguese - punjabi - russian - scottish_gaelic - serbian_cyrillic - serbian_latin - sinhala - somali - spanish - swahili - tamil - telugu - thai - tigrinya - turkish - ukrainian - urdu - uzbek - vietnamese - welsh - yoruba ## Training hyperparameters The model trained with a p4d.24xlarge instance on aws sagemaker, with the following config: - model: deltalm base - batch size: 8 - learning rate: 1e-5 - number of epochs: 3 - warmup steps: 500 - weight decay: 0.01