Sentiment Analysis for Darija (Arabic Dialect)

This repository hosts a Sentiment Analysis model for Darija (Moroccan Arabic dialect), built using BERT. The model is fine-tuned to classify text into two categories: positive and negative sentiment. It is designed to facilitate sentiment analysis in applications involving Darija text data, such as social media analysis, customer feedback, or market research.

Model Details

Base Model: SI2M-Lab/DarijaBERT
Task: Sentiment Classification (Binary)
Architecture: BERT with a custom classification head and dropout regularization (0.3 dropout rate).
Fine-Tuning Data: Dataset of labeled Darija text samples (positive and negative).
Max Sequence Length: 128 tokens

How to Use

Load the Model and Tokenizer

To use this model for sentiment analysis, you can load it using the Transformers library:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("BenhamdaneNawfal/sentiment-analysis-darija")
model = AutoModelForSequenceClassification.from_pretrained("BenhamdaneNawfal/sentiment-analysis-darija")

# Example text
test_text = "هذا المنتج رائع جدا"

# Tokenize the text
inputs = tokenizer(test_text, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Get model predictions
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax().item()

print(f"Predicted class: {predicted_class}")

Output Classes

0: Negative
1: Positive

Fine-Tuning Process

The model was fine-tuned using the following:

Dataset: A dataset of Darija text labeled for sentiment.
Loss Function: Cross-entropy loss for binary classification.
Optimizer: AdamW with weight decay (0.01).
Learning Rate: 5e-5 with linear warmup.
Batch Size: 16 for training, 64 for evaluation.
Early Stopping: Training stops if validation loss does not improve after 1 epoch.

Evaluation Metrics

The model's performance was evaluated using the following metrics:

Accuracy: 80%
Precision: 0.81%
Recall: 0.79%
F1-Score: 0.80%

Publishing on Hugging Face

The model and tokenizer were saved and uploaded to Hugging Face using the huggingface_hub library. To reproduce or fine-tune this model, follow these steps:

Save the model and tokenizer:

model.save_pretrained("darija-bert-model")
tokenizer.save_pretrained("darija-bert-model")

Future Work

Expand the dataset to include more labeled examples from diverse sources.
Fine-tune the model for multi-class sentiment analysis (e.g., neutral, positive, negative).
Explore the use of data augmentation techniques for better generalization.

Citation

If you use this model, please cite it as:

@misc{benhamdanenawfal2025darijabert,
  author = {Benhamdane Nawfal},
  title = {Sentiment Analysis for Darija (Arabic Dialect)},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/BenhamdaneNawfal/sentiment-analysis-darija}
}

Contact

For any questions or issues, feel free to contact me at: [[email protected]].