VBART Model Card

Model Description

VBART is the first sequence-to-sequence LLM pre-trained on Turkish corpora from scratch on a large scale. It was pre-trained by VNGRS in February 2023.
The model is capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation when fine-tuned. It outperforms its multilingual counterparts, albeit being much smaller than other implementations.

VBART-XLarge is created by adding extra Transformer layers between the layers of VBART-Large. Hence it was able to transfer learned weights from the smaller model while doublings its number of layers. VBART-XLarge improves the results compared to VBART-Large albeit in small margins.

This repository contains fine-tuned TensorFlow and Safetensors weights of VBART for text summarization task.

  • Developed by: VNGRS-AI
  • Model type: Transformer encoder-decoder based on mBART architecture
  • Language(s) (NLP): Turkish
  • License: CC BY-NC-SA 4.0
  • Finetuned from: VBART-XLarge
  • Paper: arXiv

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("vngrs-ai/VBART-XLarge-Summarization",
                            model_input_names=['input_ids', 'attention_mask'])
# Uncomment the device_map kwarg and delete the closing bracket to use model for inference on GPU
model = AutoModelForSeq2SeqLM.from_pretrained("vngrs-ai/VBART-XLarge-Summarization")#, device_map="auto")

input_text="..."

token_input = tokenizer(input_text, return_tensors="pt")#.to('cuda')
outputs = model.generate(**token_input)
print(tokenizer.decode(outputs[0]))

Training Details

Training Data

The base model is pre-trained on vngrs-web-corpus. It is curated by cleaning and filtering Turkish parts of OSCAR-2201 and mC4 datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found on their respective pages. Data is filtered using a set of heuristics and certain rules, explained in the appendix of our paper.

The fine-tuning dataset is the Turkish sections of MLSum, TRNews, XLSum and Wikilingua datasets.

Limitations

This model is fine-tuned for paraphrasing tasks. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.

Training Procedure

Pre-trained for 30 days and for a total of 708B tokens. Finetuned for 20 epoch.

Hardware

  • GPUs: 8 x Nvidia A100-80 GB

Software

  • TensorFlow

Hyperparameters

Pretraining
  • Training regime: fp16 mixed precision
  • Training objective: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
  • Optimizer : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
  • Scheduler: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
  • Dropout: 0.1 (dropped to 0.05 and then to 0 in the last 165k and 205k steps, respectively)
  • Initial Learning rate: 5e-6
  • Training tokens: 708B
Fine-tuning
  • Training regime: fp16 mixed precision
  • Optimizer : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
  • Scheduler: Linear decay scheduler
  • Dropout: 0.1
  • Learning rate: 1e-5
  • Fine-tune epochs: 20

Metrics

image/png

Citation

@article{turker2024vbart,
  title={VBART: The Turkish LLM},
  author={Turker, Meliksah and Ari, Erdi and Han, Aydin},
  journal={arXiv preprint arXiv:2403.01308},
  year={2024}
}
Downloads last month
16
Safetensors
Model size
740M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train vngrs-ai/VBART-XLarge-Summarization

Collection including vngrs-ai/VBART-XLarge-Summarization