metadata

language: ar
datasets:
  - Marefa-NER

Tebyan تبيـان

Marefa Arabic Named Entity Recognition Model

نموذج المعرفة لتصنيف أجزاء النص

Model description

Marefa-NER is a Large Arabic NER model built on a completely new dataset and targets to extract up to 9 different types of entities

Person, Location, Organization, Nationality, Job, Product, Event, Time, Art-Work

نموذج المعرفة لتصنيف أجزاء النص. نموذج جديد كليا من حيث البيانات المستخدمة في تدريب النموذج. كذلك يستهدف النموذج تصنيف حتى 9 أنواع مختلفة من أجزاء النص

شخص - مكان - منظمة - جنسية - وظيفة - منتج - حدث - توقيت - عمل إبداعي

How to use كيف تستخدم النموذج

Install transformers AND nltk (python >= 3.6)

$ pip3 install transformers==4.3.0 nltk==3.5 protobuf==3.15.3 torch==1.7.1

If you are using Google Colab, please restart your runtime after installing the packages.

# we need to install NLTK punkt to be used for word tokenization
from collections import defaultdict
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# ===== import the model
m_name = "marefa-nlp/marefa-ner"
tokenizer = AutoTokenizer.from_pretrained(m_name)
model = AutoModelForTokenClassification.from_pretrained(m_name)

# ===== build the NER pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)

# ===== extract the entities from a sample text
example = 'قاد عمر المختار القوات في ليبيا ضد الجيش الإيطالي'
# clean the text
example = " ".join(word_tokenize(example))
# feed to the NER model to parse
ner_results = nlp(example)

# as the [grouped_entities] parameter does not perform well in Arabic,
# we prepared a simple fixing code to generate full entities tokens

grouped_ner_results = defaultdict(list)
fixed_ner_results = []
for ent in ner_results:
  grouped_ner_results[ent['entity_group']].append(ent)


for group, ents in grouped_ner_results.items():
  if len(ents) == 1:
    fixed_ner_results.append(ents[0])
    continue
  
  current_ent = {"word": ents[0]['word'], "start": ents[0]['start'], "end": ents[0]['end'], "entity_group": group, "score": ents[0]['score']}
  for i in range(1, len(ents)):
    if ents[i]['start'] == current_ent["end"]:
      current_ent["word"] += ents[i]['word']
      current_ent["end"] = ents[i]['end']
      current_ent["score"] = max(ents[i]['score'], current_ent["score"])
    else:
      fixed_ner_results.append(current_ent)
      current_ent = {"word": ents[i]['word'], "start": ents[i]['start'], "end": ents[i]['end'], "entity_group": group, "score": ents[i]['score']}
  
  fixed_ner_results.append(current_ent)
  
# sort entities
fixed_ner_results = sorted(fixed_ner_results, key=lambda e: e['start'], reverse=False)

# ===== print the ner_results
for ent in fixed_ner_results:
  print(ent["word"], '->' ,ent['entity_group'], " # score:", "%.2f" % ent['score'])
  
#####
# عمر المختار -> person  # score: 1.00
# ليبيا -> location  # score: 0.99
# الجيش الإيطالي -> organization  # score: 0.99
####

Acknowledgment شكر و تقدير

قام بإعداد البيانات التي تم تدريب النموذج عليها, مجموعة من المتطوعين الذين قضوا ساعات يقومون بتنقيح البيانات و مراجعتها

على سيد عبد الحفيظ - إشراف
نرمين محمد عطيه
احمد علي عبدربه
عمر بن عبد العزيز سليمان
محمد ابراهيم الجمال
عبدالرحمن سلامه خلف
إبراهيم كمال محمد سليمان
حسن مصطفى حسن
أحمد فتحي سيد
عثمان مندو
عارف الشريف
أميرة محمد محمود
حسن سعيد حسن
عبد العزيز علي البغدادي
واثق عبدالملك الشويطر
عمرو رمضان عقل الحفناوي
حسام الدين أحمد على
أسامه أحمد محمد محمد
حاتم محمد المفتي
عبد الله دردير
أدهم البغدادي
أحمد صبري
عبدالوهاب محمد محمد
أحمد محمد عوض