Sin2pi
/

whisper_experimental

@@ -1,51 +1,41 @@
-Rotary embeddings with added rotation_dynamics is showing real promise.. Combo of these two:
-This is experimental code for the integration of different attention mechanisms in a transformer whisper-like model. Although everything runs and works, the code is not complete and is meant to be used as a reference for further development.
-The idea is to create a transformer model that can learn from different attention mechanisms and combine them to improve performance. This model includes the following attention mechanisms:
-1. Dynamic Convolutional Attention
-2. Hybrid Attention
-3. Biased Attention
-4. Augmented Memory
-5. Rotary and Learned Sinusoidal Embeddings
-6. Dynamic Attention Integration: A mechanism that dynamically weights and integrates the outputs of multiple attention mechanisms based on their relevance.
-7. Rotation-Based Attention: Applied rotational dynamics to the attention mechanisms to reduce interference and enhance learning.
-Other components like LayerNorm, GroupNorm (RMS), custom linear and conv1 blocks, and MLP are also included to complete the transformer block. The code is written in PyTorch and is meant to be used as a reference for further development and experimentation. Training loop and data processing are included in PyTorch, with optional adaptation for Hugging Face Trainer and datasets. The code is not optimized and is meant for educational purposes only.
-The goal is to integrate these mechanisms into a single transformer block that can leverage the strengths of each attention mechanism to improve performance on various tasks.
-Natural Synergies and Integration Ideas for Future Stuff and Things:
-Dynamic Convolutional Attention and Hybrid Attention
-Natural Synergy: Dynamic convolutional attention can adapt convolutional filters based on the input, providing a local context that hybrid attention can leverage. Combining these can enhance both global and local context understanding.
-Integration Idea: Use dynamic convolutional attention within the local convolution part of the hybrid attention mechanism. This way, the dynamic adjustments made by the convolutional attention can be directly utilized by the hybrid attention layers.
-Biased Attention and Augmented Memory
-Natural Synergy: Biased attention can prioritize important features, while augmented memory can store and retrieve long-term dependencies. Together, they can ensure that important features are not only highlighted but also remembered over long sequences.
-Integration Idea: Embed bias terms in the augmented memory retrieval process, allowing the model to focus on and recall important features over extended periods.
-Rotary and Learned Sinusoidal Embeddings with Hybrid Attention
-Natural Synergy: Rotary and learned sinusoidal embeddings enhance positional encoding, which can be crucial for hybrid attention mechanisms that need to maintain the order of information while attending to both local and global contexts.
-Integration Idea: Apply rotary and learned sinusoidal embeddings within the hybrid attention layers to improve positional awareness and ensure the model accurately captures the order and structure of the input data.
-Enhancements to Rotary Embeddings
-Rotary Embeddings: We incorporated rotary embeddings to encode positional information into the queries and keys, helping the model capture the relative positions of tokens within a sequence.
-Rotational Dynamics Layer: We integrated a custom rotation layer that applies orthogonal transformations to the embeddings. This helps reduce interference between sensory inputs and memory representations by rotating the embeddings, which enhances the model’s ability to process and differentiate sequential data.
-Dynamic Integration of Attention Mechanisms: Introduced a mechanism that dynamically weights and integrates the outputs of multiple attention mechanisms based on their relevance and performance. This ensures that the model leverages the most effective attention mechanism for a given input, leading to more contextually appropriate and robust processing.

+Dynamic Base Adjustment
+Self-Adjusting Parameters: The model dynamically adjusts the base parameter in response to training loss, optimizing positional embeddings in real-time. This adaptive mechanism enhances the model's ability to fine-tune itself during training, ensuring better performance and efficiency.
+RotaryEmbeddingWithRotation
+Orthogonally Initialized Rotation Matrix: This component combines rotary embeddings with an orthogonally initialized rotation matrix, providing robust and stable positional embeddings. This novel approach enhances the model’s capacity to represent positional information effectively.
+LearnedSinusoidalEmbeddings
+Learned Sinusoidal Embeddings with Checkpointing: This unique integration of sinusoidal embeddings with optional checkpointing helps manage memory efficiently during training while maintaining stable embedding magnitudes through L2 normalization.
+MultiHeadAttention
+Dynamic Positional Bias: Supports rotary embeddings and includes relative positional bias, capturing dependencies effectively. The attention mechanism is finely tuned with a dynamically adjustable base parameter, providing flexibility and precision.
+HybridAttention
+Combining Local and Global Attention: This component leverages both local and global attention mechanisms, ensuring that the model captures both fine-grained and broad context. The sliding window approach for local attention enhances its ability to process long sequences efficiently.
+DynamicConvAttention
+Integrating Convolution and Attention: This component enriches feature representation by combining convolutional layers with attention mechanisms, enabling the model to extract local context while attending to global information simultaneously.
+Model Components
+LayerNorm: Custom normalization with gamma and beta parameters.
+Linear: Custom linear layer with batch normalization and various activation functions.
+Conv1d: Custom 1D convolution layer with Kaiming initialization.
+RotaryEmbeddingWithRotation: Orthogonally initialized rotary embeddings with dynamic base adjustment.
+LearnedSinusoidalEmbeddings: Sinusoidal embeddings with optional checkpointing and L2 normalization.
+MultiHeadAttention: Dynamic positional bias with rotary embeddings and optional caching.
+ResidualAttentionBlock: Integrates self and cross-attention with GELU-activated MLP.
+AudioEncoder: Convolutional layers with learned sinusoidal embeddings and rotary embeddings.
+TextDecoder: Token embeddings with rotary embeddings and cross-attention.
+DynamicConvAttention: Combines convolution and attention for enriched feature extraction.
+HybridAttention: Merges local and global attention mechanisms using sliding window and multi-head attention.