|
--- |
|
language: pt |
|
license: mit |
|
tags: |
|
- bert |
|
- pytorch |
|
datasets: |
|
- Twitter |
|
--- |
|
|
|
**Paper:** For more details, please refer to our paper: [BERTabaporu: Assessing a Genre-Specific Language Model for Portuguese NLP](https://aclanthology.org/2023.ranlp-1.24/) |
|
|
|
|
|
## Introduction |
|
|
|
BERTabaporu is a Brazilian Portuguese BERT model in the Twitter domain. The model has been built from a collection of 238 million tweets written by over 100 thousand unique Twitter users, and conveying over 2.9 billion tokens in total. |
|
|
|
## Available models |
|
|
|
| Model | Arch. | #Layers | #Params | |
|
| ---------------------------------------- | ---------- | ------- | ------- | |
|
| `pablocosta/bertabaporu-base-uncased` | BERT-Base | 12 | 110M | |
|
| `pablocosta/bertabaporu-large-uncased` | BERT-Large | 24 | 335M | |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer # Or BertTokenizer |
|
from transformers import AutoModelForPreTraining # Or BertForPreTraining for loading pretraining heads |
|
from transformers import AutoModel # or BertModel, for BERT without pretraining heads |
|
model = AutoModelForPreTraining.from_pretrained('pablocosta/bertabaporu-base-uncased') |
|
tokenizer = AutoTokenizer.from_pretrained('pablocosta/bertabaporu-base-uncased') |
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
## Cite us |
|
|
|
|
|
@inproceedings{costa-etal-2023-bertabaporu, |
|
title = "{BERT}abaporu: Assessing a Genre-Specific Language Model for {P}ortuguese {NLP}", |
|
author = "Costa, Pablo Botton and |
|
Pavan, Matheus Camasmie and |
|
Santos, Wesley Ramos and |
|
Silva, Samuel Caetano and |
|
Paraboni, Ivandr{\'e}", |
|
editor = "Mitkov, Ruslan and |
|
Angelova, Galia", |
|
booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing", |
|
month = sep, |
|
year = "2023", |
|
address = "Varna, Bulgaria", |
|
publisher = "INCOMA Ltd., Shoumen, Bulgaria", |
|
url = "https://aclanthology.org/2023.ranlp-1.24", |
|
pages = "217--223", |
|
abstract = "Transformer-based language models such as Bidirectional Encoder Representations from Transformers (BERT) are now mainstream in the NLP field, but extensions to languages other than English, to new domains and/or to more specific text genres are still in demand. In this paper we introduced BERTabaporu, a BERT language model that has been pre-trained on Twitter data in the Brazilian Portuguese language. The model is shown to outperform the best-known general-purpose model for this language in three Twitter-related NLP tasks, making a potentially useful resource for Portuguese NLP in general.", |
|
} |