metadata
datasets:
- csebuetnlp/xlsum
language:
- am
- ar
- az
- bn
- my
- zh
- en
- fr
- gu
- ha
- hi
- ig
- id
- ja
- rn
- ko
- ky
- mr
- ne
- om
- ps
- fa
- pcm
- pt
- pa
- ru
- gd
- sr
- si
- so
- es
- sw
- ta
- te
- th
- ti
- tr
- uk
- ur
- uz
- vi
- cy
- yo
multilinguality:
- multilingual
pipeline_tag: summarization
Model Card for Model ID
This model is fine-tuned version of DeltaLM-base on the XLSum dataset , aiming for abstractive multilingual summarization.
It achieves the following results on the evaluation set:
- rouge-1: 18.2
- rouge-2: 7.6
- rouge-l: 14.9
- rouge-lsum: 14.7
Dataset desctiption
XLSum dataset is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 45 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
Languages
- amharic
- arabic
- azerbaijani
- bengali
- burmese
- chinese_simplified
- chinese_traditional
- english
- french
- gujarati
- hausa
- hindi
- igbo
- indonesian
- japanese
- kirundi
- korean
- kyrgyz
- marathi
- nepali
- oromo
- pashto
- persian
- pidgin
- portuguese
- punjabi
- russian
- scottish_gaelic
- serbian_cyrillic
- serbian_latin
- sinhala
- somali
- spanish
- swahili
- tamil
- telugu
- thai
- tigrinya
- turkish
- ukrainian
- urdu
- uzbek
- vietnamese
- welsh
- yoruba
Training hyperparameters
The model trained with a p4d.24xlarge instance on aws sagemaker, with the following config:
- model: deltalm base
- batch size: 8
- learning rate: 1e-5
- number of epochs: 3
- warmup steps: 500
- weight decay: 0.01