GreekLegalRoBERTa_v3

A Greek lagal version of RoBERTa pre-trained language model.

Pre-training corpora

The pre-training corpora of GreekLegalRoBERTa_v3 include:

Pre-training details

  • We develop the code in Hugging Face's Transformers. We publish our code in AI-team-UoA GitHub repository (https://github.com/AI-team-UoA/GreekLegalRoBERTa).
  • We released a model similar to the English FacebookAI/roberta-base for greek legislative applications model (12-layer, 768-hidden, 12-heads, 125M parameters).
  • We train for 100k training steps with batch size of 4096 sequences of length 512 with an initial learning rate 6e-4.
  • We pretrained our models using 4 v-100 GPUs provided by Cyprus Research Institute. We would like to express our sincere gratitude to the Cyprus Research Institute for providing us with access to Cyclone. Without your support, this work would not have been possible.

Requirements

pip install torch
pip install tokenizers
pip install transformers[torch]
pip install datasets

Load Pretrained Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
model = AutoModel.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")

Use Pretrained Model as a Language Model

import torch
from transformers import *

# Load model and tokenizer
for i in range(10):
  tokenizer_greek = AutoTokenizer.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
  lm_model_greek = AutoModelWithLMHead.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
unmasker = pipeline("fill-mask", model=lm_model_greek, tokenizer=tokenizer_greek)
# ================ EXAMPLE 1 ================
print("================ EXAMPLE 1 ================")
text_1 = ' O Δικηγορος κατεθεσε ένα <mask> .'
# EN: 'The lawyer submited a <mask>.'
input_ids = tokenizer_greek.encode(text_1)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_1, top_k=5)[i]['token_str'])
#================ EXAMPLE 1 ================
#Model's answer 1 : letter
#Model's answer 2 : copy
#Model's answer 3 : record
#Model's answer 4 : memorandum
#Model's answer 5 : diagram


# ================ EXAMPLE 2 ================
print("================ EXAMPLE 2 ================")

text_2 = 'Είναι ένας <mask> άνθρωπος.'
# EN: 'He is a <mask> person.'
input_ids = tokenizer_greek.encode(text_2)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_2, top_k=5)[i]['token_str'])

#================ EXAMPLE 2 ================
#Model's answer 1 : new
#Model's answer 2 : capable
#Model's answer 3 : simple
#Model's answer 4 : serious
#Model's answer 5 : small


# ================ EXAMPLE 3 ================
print("================ EXAMPLE 3 ================")

text_3 = 'Είναι ένας <mask> άνθρωπος και κάνει συχνά <mask>.'
# EN: 'He is a <mask> person he does frequently <mask>.'
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_3, top_k=5)[0][i]['token_str']+" , " +unmasker(text_3, top_k=5)[1][i]['token_str'])

#================ EXAMPLE 3 ================
#Model's answer 1 : simple, trips
#Model's answer 2 : new, vacations
#Model's answer 3 : small, visits
#Model's answer 4 : good, mistakes
#Model's answer 5 : serious, actions

# the most plausible prediction for the second <mask> is "trips"
# ================ EXAMPLE 4 ================
print("================ EXAMPLE 4 ================")

text_4 = ' Kαθορισμός τρόπου αξιολόγησης της επιμελείς των υπαλλήλων που παρακολουθούν προγράμματα επιμόρφωσης και <mask> .'
# EN: '"Determining how to evaluate the diligence of employees attending edification and <mask> programs."'
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_4, top_k=5)[i]['token_str'])

#================ EXAMPLE 4 ================
#Model's answer 1 : retraining
#Model's answer 2 : specialization
#Model's answer 3 : training
#Model's answer 4 : education
#Model's answer 5 : Retraining

Evaluation on downstream tasks

For detailed results read the article:

TODO

Author

Downloads last month
28
Safetensors
Model size
125M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.