Model Card: banELECTRA-Base
Model Details
The benElectra model is a Bangla adaptation of ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), a pre-training method for language models introduced by researchers at Google.ELECTRA uses a unique training strategy called contrastive learning, which differs from traditional masked language modeling (MLM) methods like BERT.After pre-training, only the discriminator is fine-tuned on downstream tasks, making ELECTRA a more efficient alternative to BERT, achieving higher performance with fewer parameters.
The banELECTRA-Base model is tailored for Bangla text and fine-tuned for tasks like Named Entity Recognition (NER), Part-of-Speech (POS) tagging,Sentence Similarity,Paraphrase Identification,etc.
The model was trained on two NVIDIA GeForce A40 GPUs.
Training Data
The banELECTRA-Base model was pre-trained on a 32 GB Bangla text dataset. Below are the dataset statistics:
- Total Words: ~1.996 billion
- Unique Words: ~21.24 million
- Total Sentences: ~165.38 million
- Total Documents: ~15.62 million
Model Architecture and Training
The benELECTRA model was trained using the official ELECTRA repository with carefully selected hyperparameters to optimize performance for Bangla text. The model uses a vocabulary size of 50,000 tokens and consists of 12 hidden layers with 768 hidden dimensions and 12 attention heads in the discriminator. The generator is scaled to one-third the size of the discriminator, and training is conducted with a maximum sequence length of 256. The training employed a batch size of 96, a learning rate of 0.0004 with 10,000 warm-up steps, and a total of 1,000,000 training steps. Regularization techniques, such as a dropout rate of 0.1 and a weight decay of 0.01, were applied to improve generalization.
How to Use
from transformers import ElectraTokenizer, ElectraForSequenceClassification
model_name = "banglagov/banELECTRA-Base"
tokenizer = ElectraTokenizer.from_pretrained(model_name)
model = ElectraForSequenceClassification.from_pretrained(model_name)
text = "এর ফলে আগামী বছর বেকারত্বের হার বৃদ্ধি এবং অর্থনৈতিক মন্দার আশঙ্কায় ইউরোপীয় ইউনিয়ন ।"
inputs = tokenizer(text, return_tensors="pt")
print("Input Tokens ids:", inputs)
Experimental Results
The banELECTRA-Base model demonstrates strong performance on downstream tasks, as shown below:
Task | Precision | Recall | F1 |
---|---|---|---|
Named Entity Recognition (NER) | 0.8842 | 0.7930 | 0.8249 |
Part-of-Speech (POS) Tagging | 0.8757 | 0.8717 | 0.8706 |
Here we used banELECTRA-Base model with Noisy Label model architecture.
- Downloads last month
- 2