metadata

datasets:
  - csebuetnlp/xlsum
language:
  - am
  - ar
  - az
  - bn
  - my
  - zh
  - en
  - fr
  - gu
  - ha
  - hi
  - ig
  - id
  - ja
  - rn
  - ko
  - ky
  - mr
  - ne
  - om
  - ps
  - fa
  - pcm
  - pt
  - pa
  - ru
  - gd
  - sr
  - si
  - so
  - es
  - sw
  - ta
  - te
  - th
  - ti
  - tr
  - uk
  - ur
  - uz
  - vi
  - cy
  - yo
multilinguality:
  - multilingual
pipeline_tag: summarization

Model Card for Model ID

This model is fine-tuned version of DeltaLM-base on the XLSum dataset , aiming for abstractive multilingual summarization.

It achieves the following results on the evaluation set:

rouge-1: 18.2
rouge-2: 7.6
rouge-l: 14.9
rouge-lsum: 14.7

Dataset desctiption

XLSum dataset is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 45 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.

Languages

amharic
arabic
azerbaijani
bengali
burmese
chinese_simplified
chinese_traditional
english
french
gujarati
hausa
hindi
igbo
indonesian
japanese
kirundi
korean
kyrgyz
marathi
nepali
oromo
pashto
persian
pidgin
portuguese
punjabi
russian
scottish_gaelic
serbian_cyrillic
serbian_latin
sinhala
somali
spanish
swahili
tamil
telugu
thai
tigrinya
turkish
ukrainian
urdu
uzbek
vietnamese
welsh
yoruba

Training hyperparameters

The model trained with a p4d.24xlarge instance on aws sagemaker, with the following config:

model: deltalm base
batch size: 8
learning rate: 1e-5
number of epochs: 3
warmup steps: 500
weight decay: 0.01