Unpadding and Sequence Packing inference example?
Hi guys, where can I find "Unpadding and Sequence Packing" inference example?
As I understand there should be a code that is doing dynamical "packing" of the incoming requests into proper batches.
Hello!
The blogpost explains it a bit more: https://huggingface.co/blog/modernbert#unpadding-and-sequence-packing
In short: the unpacking and sequence packing is done automatically in the forward
method of the ModernBertModel
class.
Unpadding: https://github.com/huggingface/transformers/blob/d5aebc64653d09660818109f2fac55b5e1031023/src/transformers/models/modernbert/modeling_modernbert.py#L884-L886
Repadding: https://github.com/huggingface/transformers/blob/d5aebc64653d09660818109f2fac55b5e1031023/src/transformers/models/modernbert/modeling_modernbert.py#L932-L934
As you can see, this only happens if the attn_implementation is set to "flash_attention_2", e.g. if the model is loaded with AutoModel.from_pretrained("answerdotai/ModernBERT-base", attn_implementation="flash_attention_2")
.
If you have the requirements (e.g. flash_attn
, CUDA, etc.), then this is the default attention implementation.
- Tom Aarsen