EGYbert / README.md
noorrtamerr's picture
Update README.md
11dae5a verified
metadata
language:
  - ar
metrics:
  - perplexity
base_model:
  - google-bert/bert-base-uncased
pipeline_tag: mask-generation
datasets:
  - big_arabic_train
  - big_arabic_val
library_name: transformers
tags:
  - egyptian-arabic
  - fine-tuned
  - arabert
license: apache-2.0

EgBERT: Fine-Tuned AraBERT for Egyptian Arabic

Model Description

EgBERT is a fine-tuned version of the pre-trained AraBERT model designed for Egyptian Arabic. This model was developed to enhance performance on tasks requiring understanding and generation of Egyptian dialect text, with a focus on Masked Language Modeling (MLM). The fine-tuning process involved a custom dataset containing colloquial Egyptian Arabic, making the model particularly suited for casual and conversational text.

Key Features:

Training Details

  • Dataset:
    • A custom dataset of Egyptian Arabic collected from conversational text sources.
    • Preprocessed to include common colloquial phrases and reduce noise in data.
  • Training Setup:
    • Pre-trained model: aubmindlab/bert-base-arabert
    • Fine-tuning performed for 3 epochs with a batch size of 16.
    • Learning rate: 2e-5.
    • MLM Probability: 15%.

Evaluation Results

Model Perplexity

  • Baseline Model: 36.2377
  • Fine-Tuned Model: 26.5359

The fine-tuned model outperforms the baseline AraBERT model in terms of perplexity, indicating better performance on MLM tasks in Egyptian Arabic.

How to Use

Here’s an example of how to use EgBERT in your project:

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the fine-tuned model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("noortamerr/EgBERT")
model = AutoModelForMaskedLM.from_pretrained("noortamerr/EgBERT")

# Input text with a masked token
text = "الكورة في مصر [MASK] حاجة كل الناس بتتابعها."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

# Decode the top 5 predictions for the [MASK] token
mask_token_logits = predictions[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
predicted_words = [tokenizer.decode([token]) for token in top_5_tokens]

print(f"Predicted words: {predicted_words}")
,,,


@misc{EgBERT,
  author = {Noor Tamer, Roba Mahmoud, Orchid Hazem},
  title = {EgBERT: Fine-Tuned AraBERT for Egyptian Arabic},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/noortamerr/EgBERT}
}