---
language: ar
datasets:
- Marefa-NER
---

# Tebyan تبيـان
# Marefa Arabic Named Entity Recognition Model
# نموذج المعرفة لتصنيف أجزاء النص

## Model description

**Marefa-NER** is a Large Arabic NER model built on a completely new dataset and targets to extract up to 9 different types of entities
```
Person, Location, Organization, Nationality, Job, Product, Event, Time, Art-Work
```

نموذج المعرفة لتصنيف أجزاء النص. نموذج جديد كليا من حيث البيانات المستخدمة في تدريب النموذج. 
كذلك يستهدف النموذج تصنيف حتى 9 أنواع مختلفة من أجزاء النص
```
شخص - مكان - منظمة - جنسية - وظيفة - منتج - حدث - توقيت - عمل إبداعي
```

## How to use كيف تستخدم النموذج

Install transformers AND nltk (python >= 3.6)

`$ pip3 install transformers==4.3.0 nltk==3.5 protobuf==3.15.3 torch==1.7.1`

> If you are using `Google Colab`, please restart your runtime after installing the packages.

-----------

```python
# we need to install NLTK punkt to be used for word tokenization
from collections import defaultdict
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# ===== import the model
m_name = "marefa-nlp/marefa-ner"
tokenizer = AutoTokenizer.from_pretrained(m_name)
model = AutoModelForTokenClassification.from_pretrained(m_name)

# ===== build the NER pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)

# ===== extract the entities from a sample text
example = 'قاد عمر المختار القوات في ليبيا ضد الجيش الإيطالي'
# clean the text
example = " ".join(word_tokenize(example))
# feed to the NER model to parse
ner_results = nlp(example)

# as the [grouped_entities] parameter does not perform well in Arabic,
# we prepared a simple fixing code to generate full entities tokens

grouped_ner_results = defaultdict(list)
fixed_ner_results = []
for ent in ner_results:
  grouped_ner_results[ent['entity_group']].append(ent)


for group, ents in grouped_ner_results.items():
  if len(ents) == 1:
    fixed_ner_results.append(ents[0])
    continue
  
  current_ent = {"word": ents[0]['word'], "start": ents[0]['start'], "end": ents[0]['end'], "entity_group": group, "score": ents[0]['score']}
  for i in range(1, len(ents)):
    if ents[i]['start'] == current_ent["end"]:
      current_ent["word"] += ents[i]['word']
      current_ent["end"] = ents[i]['end']
      current_ent["score"] = max(ents[i]['score'], current_ent["score"])
    else:
      fixed_ner_results.append(current_ent)
      current_ent = {"word": ents[i]['word'], "start": ents[i]['start'], "end": ents[i]['end'], "entity_group": group, "score": ents[i]['score']}
  
  fixed_ner_results.append(current_ent)
  
# sort entities
fixed_ner_results = sorted(fixed_ner_results, key=lambda e: e['start'], reverse=False)

# ===== print the ner_results
for ent in fixed_ner_results:
  print(ent["word"], '->' ,ent['entity_group'], " # score:", "%.2f" % ent['score'])
  
#####
# عمر المختار -> person  # score: 1.00
# ليبيا -> location  # score: 0.99
# الجيش الإيطالي -> organization  # score: 0.99
####

```

## Acknowledgment شكر و تقدير

قام بإعداد البيانات التي تم تدريب النموذج عليها, مجموعة من المتطوعين الذين قضوا ساعات يقومون بتنقيح البيانات و مراجعتها

- على سيد عبد الحفيظ - إشراف
- نرمين محمد عطيه 
- احمد علي عبدربه
- عمر بن عبد العزيز سليمان
- محمد ابراهيم الجمال
- عبدالرحمن سلامه خلف
- إبراهيم كمال محمد سليمان
- حسن مصطفى حسن 
- أحمد فتحي سيد
- عثمان مندو
- عارف الشريف
- أميرة محمد محمود
- حسن سعيد حسن
- عبد العزيز علي البغدادي
- واثق عبدالملك الشويطر
- عمرو رمضان عقل الحفناوي
- حسام الدين أحمد على
- أسامه أحمد محمد محمد
- حاتم محمد المفتي
- عبد الله دردير
- أدهم البغدادي
- أحمد صبري
- عبدالوهاب محمد محمد
- أحمد محمد عوض