I’m having some strange issues with the Data Collator and Dataloader. I get the following value error ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
. What is strange is as you can see from the code below, I’m both truncating and padding. I can see that when the dataset is returned from the tokenizer the input ids are all the same length. However, when I check the input ids length after they are loaded into the dataloader the lengths are variable. If I remove the collator and batch size arguments everything works fine with the same code. I assume I’m doing something stupid with the data collator? But I’ve tried a couple collators, datasets, models, and tokenizers and I have the same issue. Any thoughts?
from transformers import (
DataCollatorWithPadding,
DataCollatorForTokenClassification,
AutoTokenizer,
AutoModelForTokenClassification,
)
from datasets import load_dataset
from torch.utils.data import DataLoader
#Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
# tokenizing function
def tokenize(data):
tokenized_samples = orig_tokenizer(
data["tokens"],
is_split_into_words=True,
truncation=True,
padding=True,
)
#load wikiann dataset
dataset = load_dataset("wikiann", "bn")["train"]
#tokenize dataset with padding and truncation
dataset_tokenized = dataset.map(tokenize, batched=True)
#remove extra columns
dataset_tokenized = dataset_tokenized.remove_columns(["langs", "spans", "tokens"])
#change tag columns to labels
dataset_tokenized = dataset_tokenized.rename_column("ner_tags", "labels")
#instantiate collator - note also tried this with DataCollatorWithPadding
collator = DataCollatorForTokenClassification(tokenizer)
#Instantiate Pytorch DataLoader
dl = DataLoader(dataset_tokenized, shuffle=True, collate_fn=collator, batch_size=2)