LegalBert-pt

Introduction

Legalbert-pt is a language model for the legal domain in the Portuguese language. The model was pre-trained to acquire specialization for the domain, and later it could be adjusted for use in specific tasks. Two versions of the model were created: one as a complement to the BERTimbau model, and the other from scratch. The effectiveness of the model based on BERTimbau was evident when analyzing the perplexity measure of the models. Experiments were also carried out in the tasks of identifying legal entities and classifying legal petitions. The results show that the use of specific language models outperforms those obtained using the generic language model in all tasks, suggesting that the specialization of the language model for the legal domain is an important factor for improving the accuracy of learning algorithms.

Keywords: Language model, Legal Bert pt br, Legal domain, Portuguese Language Model

Available models

Model Initial model #Layers #Params
LegalBert-pt SC 12 110M
LegalBert-pt FP neuralmind/bert-base-portuguese-cased 12 110M

Dataset

To pretrain various versions of the LegalBert-pt language model, we collected a total of 1.5 million legal documents in Portuguese from ten Brazilian courts. These documents consisted of four types: initial petitions, petitions, decisions, and sentences. Table shows the distribution of these documents.

The data were obtained from the Codex system of the Brazilian National Council of Justice (CNJ), which maintains the largest and most diverse set of legal texts in Brazilian Portuguese. As part of an agreement established with the researchers who authored this article, the CNJ provided these data for our research.

Data source Number of documents %
Court of Justice of the State of Ceará 80,504 5.37%
Court of Justice of the State of Piauí 90,514 6.03
Court of Justice of the State of Rio de Janeiro 33,320 2.22
Court of Justice of the State of Rondônia 971,615 64.77
Federal Regional Court of the 3rd Region 70,196 4.68
Federal Regional Court of the 5th Region 6,767 0.45
Regional Labor Court of the 9th Region 16,133 1.08
Regional Labor Court of the 11th Region 5,351 0.36
Regional Labor Court of the 13th Region 155,567 10.37
Regional Labor Court of the 23th Region 70,033 4.67
Total 1,500,000 100.00%

Usage

from transformers import AutoTokenizer  # Or BertTokenizer
from transformers import AutoModelForPreTraining  # Or BertForPreTraining for loading pretraining heads
from transformers import AutoModel  # or BertModel, for BERT without pretraining heads
  
model = AutoModelForPreTraining.from_pretrained('raquelsilveira/legalbertpt_fp')
tokenizer = AutoTokenizer.from_pretrained('raquelsilveira/legalbertpt_fp')

Cite as

Raquel Silveira, Caio Ponte, Vitor Almeida, Vládia Pinheiro, and Vasco Furtado. 2023. LegalBert-pt: A Pretrained Language Model for the Brazilian Portuguese Legal Domain. In Intelligent Systems: 12th Brazilian Conference, BRACIS 2023, Belo Horizonte, Brazil, September 25–29, 2023, Proceedings, Part III. Springer-Verlag, Berlin, Heidelberg, 268–282. https://doi.org/10.1007/978-3-031-45392-2_18

Downloads last month
279
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for raquelsilveira/legalbertpt_fp

Finetunes
1 model