answerdotai/ModernBERT-base · Unpadding and Sequence Packing inference example?

3 days ago

Hi guys, where can I find "Unpadding and Sequence Packing" inference example?
As I understand there should be a code that is doing dynamical "packing" of the incoming requests into proper batches.

bryonkucharskiGartner

about 16 hours ago

•

edited about 16 hours ago

same question here! https://github.com/AnswerDotAI/ModernBERT/issues/161

tomaarsen

Answer.AI org about 16 hours ago

Hello!

The blogpost explains it a bit more: https://huggingface.co/blog/modernbert#unpadding-and-sequence-packing
In short: the unpacking and sequence packing is done automatically in the forward method of the ModernBertModel class.
Unpadding: https://github.com/huggingface/transformers/blob/d5aebc64653d09660818109f2fac55b5e1031023/src/transformers/models/modernbert/modeling_modernbert.py#L884-L886
Repadding: https://github.com/huggingface/transformers/blob/d5aebc64653d09660818109f2fac55b5e1031023/src/transformers/models/modernbert/modeling_modernbert.py#L932-L934

As you can see, this only happens if the attn_implementation is set to "flash_attention_2", e.g. if the model is loaded with AutoModel.from_pretrained("answerdotai/ModernBERT-base", attn_implementation="flash_attention_2").
If you have the requirements (e.g. flash_attn, CUDA, etc.), then this is the default attention implementation.

Tom Aarsen