Model Card: banELECTRA-Base

Model Details

The benElectra model is a Bangla adaptation of ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), a pre-training method for language models introduced by researchers at Google.ELECTRA uses a unique training strategy called contrastive learning, which differs from traditional masked language modeling (MLM) methods like BERT.After pre-training, only the discriminator is fine-tuned on downstream tasks, making ELECTRA a more efficient alternative to BERT, achieving higher performance with fewer parameters.
The banELECTRA-Base model is tailored for Bangla text and fine-tuned for tasks like Named Entity Recognition (NER), Part-of-Speech (POS) tagging,Sentence Similarity,Paraphrase Identification,etc.The model was trained on two NVIDIA GeForce A40 GPUs.

Training Data

The banELECTRA-Base model was pre-trained on a 32 GB Bangla text dataset. Below are the dataset statistics:

Total Words: ~1.996 billion
Unique Words: ~21.24 million
Total Sentences: ~165.38 million
Total Documents: ~15.62 million

Model Architecture and Training

The benELECTRA model was trained using the official ELECTRA repository with carefully selected hyperparameters to optimize performance for Bangla text. The model uses a vocabulary size of 50,000 tokens and consists of 12 hidden layers with 768 hidden dimensions and 12 attention heads in the discriminator. The generator is scaled to one-third the size of the discriminator, and training is conducted with a maximum sequence length of 256. The training employed a batch size of 96, a learning rate of 0.0004 with 10,000 warm-up steps, and a total of 1,000,000 training steps. Regularization techniques, such as a dropout rate of 0.1 and a weight decay of 0.01, were applied to improve generalization.

How to Use

from transformers import ElectraTokenizer, ElectraForSequenceClassification

model_name = "banglagov/banELECTRA-Base"  
tokenizer = ElectraTokenizer.from_pretrained(model_name)
model = ElectraForSequenceClassification.from_pretrained(model_name)

text = "এর ফলে আগামী বছর বেকারত্বের হার বৃদ্ধি এবং অর্থনৈতিক মন্দার আশঙ্কায় ইউরোপীয় ইউনিয়ন ।"

inputs = tokenizer(text, return_tensors="pt")

print("Input Tokens ids:", inputs)

Experimental Results

The banELECTRA-Base model demonstrates strong performance on downstream tasks, as shown below:

Task	Precision	Recall	F1
Named Entity Recognition (NER)	0.8842	0.7930	0.8249
Part-of-Speech (POS) Tagging	0.8757	0.8717	0.8706

Here we used banELECTRA-Base model with Noisy Label model architecture.