--- language: - ar # Arabic metrics: - perplexity # Metric used to evaluate the model base_model: - google-bert/bert-base-uncased # The original base model used pipeline_tag: mask-generation # The task this model performs datasets: - big_arabic_train # Dataset used for training - big_arabic_val # Dataset used for validation library_name: transformers # Framework used (Hugging Face Transformers) tags: - egyptian-arabic # Add relevant tags to describe your model - fine-tuned - arabert license: apache-2.0 # Add a license (choose one appropriate for your work) --- # EgBERT: Fine-Tuned AraBERT for Egyptian Arabic ## Model Description EgBERT is a fine-tuned version of the pre-trained AraBERT model designed for Egyptian Arabic. This model was developed to enhance performance on tasks requiring understanding and generation of Egyptian dialect text, with a focus on Masked Language Modeling (MLM). The fine-tuning process involved a custom dataset containing colloquial Egyptian Arabic, making the model particularly suited for casual and conversational text. Key Features: - Based on **[aubmindlab/bert-base-arabert](https://huggingface.co/aubmindlab/bert-base-arabert)**. - Fine-tuned specifically for **Egyptian Arabic**. - Optimized for **Masked Language Modeling (MLM)** tasks. ## Training Details - **Dataset**: - A custom dataset of Egyptian Arabic collected from conversational text sources. - Preprocessed to include common colloquial phrases and reduce noise in data. - **Training Setup**: - Pre-trained model: `aubmindlab/bert-base-arabert` - Fine-tuning performed for 3 epochs with a batch size of 16. - Learning rate: 2e-5. - MLM Probability: 15%. ## Evaluation Results ### Model Perplexity - **Baseline Model**: 36.2377 - **Fine-Tuned Model**: 26.5359 The fine-tuned model outperforms the baseline AraBERT model in terms of perplexity, indicating better performance on MLM tasks in Egyptian Arabic. ## How to Use Here’s an example of how to use EgBERT in your project: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM # Load the fine-tuned model and tokenizer tokenizer = AutoTokenizer.from_pretrained("noortamerr/EgBERT") model = AutoModelForMaskedLM.from_pretrained("noortamerr/EgBERT") # Input text with a masked token text = "الكورة في مصر [MASK] حاجة كل الناس بتتابعها." # Tokenize and predict inputs = tokenizer(text, return_tensors="pt") mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] with torch.no_grad(): outputs = model(**inputs) predictions = outputs.logits # Decode the top 5 predictions for the [MASK] token mask_token_logits = predictions[0, mask_token_index, :] top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() predicted_words = [tokenizer.decode([token]) for token in top_5_tokens] print(f"Predicted words: {predicted_words}") ,,, @misc{EgBERT, author = {Noor Tamer, Roba Mahmoud, Orchid Hazem}, title = {EgBERT: Fine-Tuned AraBERT for Egyptian Arabic}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/noortamerr/EgBERT} }