File size: 2,797 Bytes
470556d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44374d5
cf78b51
 
e3fe8d0
07fa0f1
9ae921b
e3fe8d0
08dd4d9
cf78b51
 
 
 
 
 
cd49b52
cf78b51
 
cd49b52
 
cf78b51
 
cd49b52
 
e3fe8d0
cd49b52
cf78b51
cd49b52
cf78b51
 
 
 
 
 
 
 
cd49b52
cf78b51
 
 
2aeb68d
cf78b51
72fdef0
 
 
 
 
cf78b51
6750c92
 
 
 
 
 
 
 
 
 
 
08dd4d9
 
 
 
 
 
 
 
 
 
 
 
 
 
cf78b51
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: apache-2.0
datasets:
- arbml/SANAD
language:
- ar
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: text-classification
library_name: transformers
tags:
- modernbert
- arabic
---

# AraModernBert For Topic Classification

## Overview

> [!NOTE]
> This is an Experimental Arabic Model demonstrates how ModernBERT can be adapted to Arabic for tasks like topic classification.

This is an Experimental **Arabic** version of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), trained **ONLY on Topic Classification** Task using the base model of original modernbert with a custom Arabic trained tokenizer with the following details:
- **Dataset:** Arabic Wikipedia
- **Size:** 1.8 GB
- **Tokens:** 228,788,529 tokens

This model demonstrates how ModernBERT can be adapted to Arabic for tasks like topic classification.

## Model Eval Details
- **Epochs:** 3
- **Evaluation Metrics:**
  - **F1 Score:** 0.95
  - **Loss:** 0.1998
- **Training Step:** 47,862

## Dataset Used For Training:

- [SANAD DATASET](https://huggingface.co/datasets/arbml/SANAD) was used for training and testing which contains 7 different topics such as Politics, Finance, Medical, Culture, Sport , Tech and Religion.

## How to Use

The model can be used for text classification using the `transformers` library. Below is an example:

```python
from transformers import pipeline

# Load model from huggingface.co/models using our repository ID
classifier = pipeline(
    task="text-classification",
    model="Omartificial-Intelligence-Space/AraModernBert-Topic-Classifier",
)

sample = '''
PUT SOME TEXT HERE TO CLASSIFY ITS TOPIC
'''
classifier(sample)

# [{'label': 'health', 'score': 0.6779336333274841}]

```

## Test Phase Results:

- The model was evalauted on Test Set of 14181 examples of different topics, the distrubtion of these topics are:

![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F628f7a71dd993507cfcbe587%2Fn5F6_HHs9shUaHLxGirQb.png%3C%2Fspan%3E)

- The model achieved the follwoing accuracy for prediction on this test set:

![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F628f7a71dd993507cfcbe587%2FqAsAfETcFHWK55p894kyM.png%3C%2Fspan%3E)


## Citation

```
@misc{modernbert,
      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
      year={2024},
      eprint={2412.13663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13663}, 
}
```