Padding in datasets

I usually use padding in batches before I get into the datasets library.

I found that dataset.map support batched and batch_size. But it seems that only padding all examples (in dataset.map) to fixed length or max_length make sense with subsequent batch_size in creating DataLoader.

Otherwise, if I use map function like lambda x: tokenizer(x["sentence"], padding=True, truncation=True) I get errors like RuntimeError: stack expects each tensor to be equal size, but got [56] at entry 0 and [53] at entry 8 when iterating the dataloader since I could not find a way to iterating the same batches in datasets.map in dataloader.

Am I right?

1 Like

That’s because padding=True makes the tokenization pad to the longest sequence in the batch. Therefore two batches may have different length

Then having all the examples in the dataset padded to the same length could slow down training, right?

1 Like

Padding all the examples to the same length makes the training slower compared to training with padding to the maximum length per batch.
You can have a data_collator in your pytorch dataloader that does the padding to the maximum length per batch.

@lhoestq do you have an example of that?

@maximin what was your solution in place of lambda entry: self.tokenizer(entry[ padding=True,)?

padding=True in the data_collator does the padding to the maximum length of the batch, so that’s the way to go :slight_smile:

But if you want to do the tokenization in map instead of in the data collator you can, but you must add an extra padding step in the data_collator to make sure all the examples in each batch have the same length