Unpadding and Sequence Packing inference example?

#34
by denti - opened

Hi guys, where can I find "Unpadding and Sequence Packing" inference example?
As I understand there should be a code that is doing dynamical "packing" of the incoming requests into proper batches.

Hello!

The blogpost explains it a bit more: https://huggingface.co/blog/modernbert#unpadding-and-sequence-packing
In short: the unpacking and sequence packing is done automatically in the forward method of the ModernBertModel class.
Unpadding: https://github.com/huggingface/transformers/blob/d5aebc64653d09660818109f2fac55b5e1031023/src/transformers/models/modernbert/modeling_modernbert.py#L884-L886
Repadding: https://github.com/huggingface/transformers/blob/d5aebc64653d09660818109f2fac55b5e1031023/src/transformers/models/modernbert/modeling_modernbert.py#L932-L934

As you can see, this only happens if the attn_implementation is set to "flash_attention_2", e.g. if the model is loaded with AutoModel.from_pretrained("answerdotai/ModernBERT-base", attn_implementation="flash_attention_2").
If you have the requirements (e.g. flash_attn, CUDA, etc.), then this is the default attention implementation.

  • Tom Aarsen

Sign up or log in to comment