marefa-ner / README.md

sort ner results

716cad6 almost 4 years ago

4.63 kB

	---
	language: ar
	datasets:
	- Marefa-NER
	---

	# Tebyan تبيـان
	# Marefa Arabic Named Entity Recognition Model
	# نموذج المعرفة لتصنيف أجزاء النص

	## Model description

	Marefa-NER is a Large Arabic NER model built on a completely new dataset and targets to extract up to 9 different types of entities
	```
	Person, Location, Organization, Nationality, Job, Product, Event, Time, Art-Work
	```

	نموذج المعرفة لتصنيف أجزاء النص. نموذج جديد كليا من حيث البيانات المستخدمة في تدريب النموذج.
	كذلك يستهدف النموذج تصنيف حتى 9 أنواع مختلفة من أجزاء النص
	```
	شخص - مكان - منظمة - جنسية - وظيفة - منتج - حدث - توقيت - عمل إبداعي
	```

	## How to use كيف تستخدم النموذج

	Install transformers AND nltk (python >= 3.6)

	`$ pip3 install transformers==4.3.0 nltk==3.5 protobuf==3.15.3 torch==1.7.1`

	> If you are using `Google Colab`, please restart your runtime after installing the packages.

	-----------

	```python
	# we need to install NLTK punkt to be used for word tokenization
	from collections import defaultdict
	import nltk
	nltk.download('punkt')
	from nltk.tokenize import word_tokenize

	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	# ===== import the model
	m_name = "marefa-nlp/marefa-ner"
	tokenizer = AutoTokenizer.from_pretrained(m_name)
	model = AutoModelForTokenClassification.from_pretrained(m_name)

	# ===== build the NER pipeline
	nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)

	# ===== extract the entities from a sample text
	example = 'قاد عمر المختار القوات في ليبيا ضد الجيش الإيطالي'
	# clean the text
	example = " ".join(word_tokenize(example))
	# feed to the NER model to parse
	ner_results = nlp(example)

	# as the [grouped_entities] parameter does not perform well in Arabic,
	# we prepared a simple fixing code to generate full entities tokens

	grouped_ner_results = defaultdict(list)
	fixed_ner_results = []
	for ent in ner_results:
	grouped_ner_results[ent['entity_group']].append(ent)


	for group, ents in grouped_ner_results.items():
	if len(ents) == 1:
	fixed_ner_results.append(ents[0])
	continue

	current_ent = {"word": ents[0]['word'], "start": ents[0]['start'], "end": ents[0]['end'], "entity_group": group, "score": ents[0]['score']}
	for i in range(1, len(ents)):
	if ents[i]['start'] == current_ent["end"]:
	current_ent["word"] += ents[i]['word']
	current_ent["end"] = ents[i]['end']
	current_ent["score"] = max(ents[i]['score'], current_ent["score"])
	else:
	fixed_ner_results.append(current_ent)
	current_ent = {"word": ents[i]['word'], "start": ents[i]['start'], "end": ents[i]['end'], "entity_group": group, "score": ents[i]['score']}

	fixed_ner_results.append(current_ent)

	# sort entities
	fixed_ner_results = sorted(fixed_ner_results, key=lambda e: e['start'], reverse=False)

	# ===== print the ner_results
	for ent in fixed_ner_results:
	print(ent["word"], '->' ,ent['entity_group'], " # score:", "%.2f" % ent['score'])

	#####
	# عمر المختار -> person # score: 1.00
	# ليبيا -> location # score: 0.99
	# الجيش الإيطالي -> organization # score: 0.99
	####

	```

	## Acknowledgment شكر و تقدير

	قام بإعداد البيانات التي تم تدريب النموذج عليها, مجموعة من المتطوعين الذين قضوا ساعات يقومون بتنقيح البيانات و مراجعتها

	- على سيد عبد الحفيظ - إشراف
	- نرمين محمد عطيه
	- احمد علي عبدربه
	- عمر بن عبد العزيز سليمان
	- محمد ابراهيم الجمال
	- عبدالرحمن سلامه خلف
	- إبراهيم كمال محمد سليمان
	- حسن مصطفى حسن
	- أحمد فتحي سيد
	- عثمان مندو
	- عارف الشريف
	- أميرة محمد محمود
	- حسن سعيد حسن
	- عبد العزيز علي البغدادي
	- واثق عبدالملك الشويطر
	- عمرو رمضان عقل الحفناوي
	- حسام الدين أحمد على
	- أسامه أحمد محمد محمد
	- حاتم محمد المفتي
	- عبد الله دردير
	- أدهم البغدادي
	- أحمد صبري
	- عبدالوهاب محمد محمد
	- أحمد محمد عوض