DataCollatorWithPaddings without Tokenizer

fredguth · October 25, 2021, 8:39pm

I want to fine-tune a model…
model = BertForTokenClassification.from_pretrained('monilouise/ner_pt_br'
with this dataset:
raw_datasets = load_dataset('lener_br')

The raw_datasets loaded are already tokenized and encoded. And I don’t know how it was tokenized. Now, I want to pad the inputs, but I don’t know how to use DataCollatorWithPaddings in this case.

I noticed that this dataset is similar to wnut dataset from the docs. Still, I can’t figure out what should I do.

sgugger · October 25, 2021, 8:41pm

You can use the base BERT tokenizer I would say (since it’s a BERT model). Just make sure the pad token is compatible with what the model expects.

fredguth · October 25, 2021, 8:44pm

Is there a way to check this from the downloaded model? Or this is something I will find in the model card?

sgugger · October 25, 2021, 8:47pm

Check the model config pad_token_id field.

Topic		Replies	Views
Key error: 0 in DataCollatorForSeq2Seq for BERT Beginners	10	3923	March 13, 2024
BERT embeddings on big dataset 🤗Datasets	3	68	August 28, 2024
Finetuning Bert for Question answering task without context Models	1	575	June 21, 2024
Tokenizer to dataset to datacollator Beginners	1	1300	April 28, 2022
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2334	May 9, 2022

DataCollatorWithPaddings without Tokenizer

Related topics