pablocosta
/

bertabaporu-base-uncased

Inference Endpoints

Model card Files Files and versions Community

bertabaporu-base-uncased / README.md

pablocosta's picture

Update README.md

eee618e about 1 year ago

|

2.61 kB

	---
	language: pt
	license: mit
	tags:
	- bert
	- pytorch
	datasets:
	- Twitter
	---

	Paper: For more details, please refer to our paper: [BERTabaporu: Assessing a Genre-Specific Language Model for Portuguese NLP](https://aclanthology.org/2023.ranlp-1.24/)


	## Introduction

	BERTabaporu is a Brazilian Portuguese BERT model in the Twitter domain. The model has been built from a collection of 238 million tweets written by over 100 thousand unique Twitter users, and conveying over 2.9 billion tokens in total.

	## Available models

	\| Model \| Arch. \| #Layers \| #Params \|
	\| ---------------------------------------- \| ---------- \| ------- \| ------- \|
	\| `pablocosta/bertabaporu-base-uncased` \| BERT-Base \| 12 \| 110M \|
	\| `pablocosta/bertabaporu-large-uncased` \| BERT-Large \| 24 \| 335M \|

	## Usage

	```python
	from transformers import AutoTokenizer # Or BertTokenizer
	from transformers import AutoModelForPreTraining # Or BertForPreTraining for loading pretraining heads
	from transformers import AutoModel # or BertModel, for BERT without pretraining heads
	model = AutoModelForPreTraining.from_pretrained('pablocosta/bertabaporu-base-uncased')
	tokenizer = AutoTokenizer.from_pretrained('pablocosta/bertabaporu-base-uncased')
	```






	## Cite us


	@inproceedings{costa-etal-2023-bertabaporu,
	title = "{BERT}abaporu: Assessing a Genre-Specific Language Model for {P}ortuguese {NLP}",
	author = "Costa, Pablo Botton and
	Pavan, Matheus Camasmie and
	Santos, Wesley Ramos and
	Silva, Samuel Caetano and
	Paraboni, Ivandr{\'e}",
	editor = "Mitkov, Ruslan and
	Angelova, Galia",
	booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
	month = sep,
	year = "2023",
	address = "Varna, Bulgaria",
	publisher = "INCOMA Ltd., Shoumen, Bulgaria",
	url = "https://aclanthology.org/2023.ranlp-1.24",
	pages = "217--223",
	abstract = "Transformer-based language models such as Bidirectional Encoder Representations from Transformers (BERT) are now mainstream in the NLP field, but extensions to languages other than English, to new domains and/or to more specific text genres are still in demand. In this paper we introduced BERTabaporu, a BERT language model that has been pre-trained on Twitter data in the Brazilian Portuguese language. The model is shown to outperform the best-known general-purpose model for this language in three Twitter-related NLP tasks, making a potentially useful resource for Portuguese NLP in general.",
	}