Bidirectional or Casual?
class BidirectionalMistralModel(MistralModel):
config_class = BidirectionalMistralConfig
def __init__(self, config: MistralConfig):
super().__init__(config)
for layer in self.layers:
layer.self_attn.is_causal = False
self._attn_implementation = "eager"
However, MistralAttention doesn't use is_causal.
MistralFlashAttention2 uses layer.self_attn.is_causal.
MistralSdpaAttention doesn't use layer.self_attn.is_causal.
Hi, @AlignLearner . Please refer this for Sdpa attention use is_causal: "https://github.com/huggingface/transformers/blob/v4.37.2/src/transformers/models/mistral/modeling_mistral.py#L692".
@nada5
However, In v4.44.2, Sdpa attention don't use is_causal
https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/models/mistral/modeling_mistral.py#L475C9-L484C10
Following https://huggingface.co/nvidia/NV-Embed-v2#2-required-packages
And in v4.42.4, Sdpa attention don't use is_causal
https://github.com/huggingface/transformers/blob/v4.42.4/src/transformers/models/mistral/modeling_mistral.py#L645C19-L655C1
Hi, @AlignLearner . In fact, NV-Embed adopt the eager mode that does not use spda attention.