BERT base for Dhivehi

Pretrained model on Dhivehi language using masked language modeling (MLM).

Tokenizer

The WordPiece tokenizer uses several components:

  • Normalization: lowercase and then NFKD unicode normalization.
  • Pretokenization: splits by whitespace and punctuation.
  • Postprocessing: single sentences are output in format [CLS] sentence A [SEP] and pair sentences in format [CLS] sentence A [SEP] sentence B [SEP].

Training

Training was performed over 16M+ Dhivehi sentences/paragraphs put together by @ashraq. An Adam optimizer with weighted decay was used with following parameters:

  • Learning rate: 1e-5
  • Weight decay: 0.1
  • Warmup steps: 10% of data
Downloads last month
5
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.