Sin2pi commited on
Commit
a29082e
·
verified ·
1 Parent(s): 2e3a4de

Upload Echo

Browse files
Files changed (4) hide show
  1. README.md +21 -80
  2. config.json +61 -0
  3. generation_config.json +12 -0
  4. model.safetensors +3 -0
README.md CHANGED
@@ -1,38 +1,10 @@
1
- Untrained whisper-like model (Echo) with code and trainer (hf). ASR model for experimentation. Working. Initialized model is "medium":
2
-
3
- ""
4
- config = WhisperConfig(
5
- n_mels=80,
6
- n_audio_ctx=1500,
7
- n_audio_state=1024,
8
- n_audio_head=16,
9
- n_audio_layer=24,
10
- vocab_size=51865,
11
- n_text_ctx=448,
12
- n_text_state=1024,
13
- n_text_head=16,
14
- n_text_layer=20,
15
- max_rel_dist=15,
16
- cross_attention=True,
17
- checkpointing=True,
18
- base=10000,
19
- bos_token_id = 50257,
20
- eos_token_id = 50257,
21
- pad_token_id = 50257,
22
- decoder_start_token_id = 50258,
23
- is_encoder_decoder = True,
24
- init_std=0.02,
25
- )
26
-
27
- model = Echo(config).to(device)
28
- ""
29
-
30
-
31
-
32
  Dynamic Base Adjustment
33
  Self-Adjusting Parameters: The model dynamically adjusts the base parameter in response to training loss, optimizing positional embeddings in real-time. This adaptive mechanism enhances the model's ability to fine-tune itself during training, ensuring better performance and efficiency.
34
 
35
- Rotary embedding with ortho-rotation matrix + Givens rotation - Block
36
  Orthogonally Initialized Rotation Matrix: This component combines rotary embeddings with an orthogonally initialized rotation matrix, providing robust and stable positional embeddings. This novel approach enhances the model’s capacity to represent positional information effectively.
37
 
38
  LearnedSinusoidalEmbeddings
@@ -41,63 +13,32 @@ Learned Sinusoidal Embeddings with Checkpointing: This unique integration of sin
41
  MultiHeadAttention
42
  Dynamic Positional Bias: Supports rotary embeddings and includes relative positional bias, capturing dependencies effectively. The attention mechanism is finely tuned with a dynamically adjustable base parameter, providing flexibility and precision.
43
 
44
- HybridAttention - not included in initialized model
45
  Combining Local and Global Attention: This component leverages both local and global attention mechanisms, ensuring that the model captures both fine-grained and broad context. The sliding window approach for local attention enhances its ability to process long sequences efficiently.
46
 
47
- DynamicConvAttention - not included in initialized model
48
  Integrating Convolution and Attention: This component enriches feature representation by combining convolutional layers with attention mechanisms, enabling the model to extract local context while attending to global information simultaneously.
49
 
50
-
51
  LayerNorm: Custom normalization with gamma and beta parameters.
52
 
53
  Linear: Custom linear layer with batch normalization and various activation functions.
54
 
55
  Conv1d: Custom 1D convolution layer with Kaiming initialization.
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
- Part of rotation block:We initialize the model with:
59
- $$
60
- n_{\text{state}}, n_{\text{head}}, \text{num\_rotations}, \text{base}=10000, \text{checkpointing}=\text{False}
61
- $$
62
-
63
- The hidden dimension \( \text{h\_dim} \) is calculated as:
64
- $$
65
- \text{h\_dim} = \frac{n_{\text{state}}}{n_{\text{head}}}
66
- $$
67
-
68
- The parameters \texttt{thetas} and \texttt{rotation\_pairs} are initialized as:
69
- $$
70
- \texttt{thetas} = \mathbf{0}
71
- $$
72
- $$
73
- \texttt{rotation\_pairs} = \text{rand}(\text{num\_rotations}, 2) \times \text{h\_dim}
74
- $$
75
-
76
- The rotation matrix is an identity matrix:
77
- $$
78
- \texttt{rotation\_matrix} = \mathbf{I}_{\text{h\_dim}}
79
- $$
80
-
81
- The inverse frequency is computed as:
82
- $$
83
- \texttt{inv\_freq} = \frac{1.0}{\text{base}^{\frac{\text{torch.arange}(0, \text{h\_dim}, 2)}{\text{h\_dim}}}}
84
- $$
85
-
86
- The Givens rotation matrix \( G \) is defined as:
87
- $$
88
- G = \mathbf{I}_{n_{\text{state}}}
89
- $$
90
- $$
91
- G_{ii} = \cos(\theta), \quad G_{ij} = -\sin(\theta)
92
- $$
93
- $$
94
- G_{ji} = \sin(\theta), \quad G_{jj} = \cos(\theta)
95
- $$
96
-
97
- The rotary orthogonal matrix \( R \) used in the forward pass is computed as:
98
- $$
99
- R = \text{rotation\_matrix} \cdot G
100
- $$
101
-
102
- $$ \mathbf{x}{\text{transformed}} = \mathbf{x} \cdot \left( \prod{k=1}^{N} G_k \right) \cdot R $$
103
 
 
1
+ ---
2
+ {}
3
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  Dynamic Base Adjustment
5
  Self-Adjusting Parameters: The model dynamically adjusts the base parameter in response to training loss, optimizing positional embeddings in real-time. This adaptive mechanism enhances the model's ability to fine-tune itself during training, ensuring better performance and efficiency.
6
 
7
+ RotaryEmbeddingWithRotation
8
  Orthogonally Initialized Rotation Matrix: This component combines rotary embeddings with an orthogonally initialized rotation matrix, providing robust and stable positional embeddings. This novel approach enhances the model’s capacity to represent positional information effectively.
9
 
10
  LearnedSinusoidalEmbeddings
 
13
  MultiHeadAttention
14
  Dynamic Positional Bias: Supports rotary embeddings and includes relative positional bias, capturing dependencies effectively. The attention mechanism is finely tuned with a dynamically adjustable base parameter, providing flexibility and precision.
15
 
16
+ HybridAttention
17
  Combining Local and Global Attention: This component leverages both local and global attention mechanisms, ensuring that the model captures both fine-grained and broad context. The sliding window approach for local attention enhances its ability to process long sequences efficiently.
18
 
19
+ DynamicConvAttention
20
  Integrating Convolution and Attention: This component enriches feature representation by combining convolutional layers with attention mechanisms, enabling the model to extract local context while attending to global information simultaneously.
21
 
22
+ Model Components
23
  LayerNorm: Custom normalization with gamma and beta parameters.
24
 
25
  Linear: Custom linear layer with batch normalization and various activation functions.
26
 
27
  Conv1d: Custom 1D convolution layer with Kaiming initialization.
28
 
29
+ RotaryEmbeddingWithRotation: Orthogonally initialized rotary embeddings with dynamic base adjustment.
30
+
31
+ LearnedSinusoidalEmbeddings: Sinusoidal embeddings with optional checkpointing and L2 normalization.
32
+
33
+ MultiHeadAttention: Dynamic positional bias with rotary embeddings and optional caching.
34
+
35
+ ResidualAttentionBlock: Integrates self and cross-attention with GELU-activated MLP.
36
+
37
+ AudioEncoder: Convolutional layers with learned sinusoidal embeddings and rotary embeddings.
38
+
39
+ TextDecoder: Token embeddings with rotary embeddings and cross-attention.
40
+
41
+ DynamicConvAttention: Combines convolution and attention for enriched feature extraction.
42
 
43
+ HybridAttention: Merges local and global attention mechanisms using sliding window and multi-head attention.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "activation_function": "gelu",
4
+ "apply_spec_augment": false,
5
+ "architectures": [
6
+ "Echo"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "base": 10000,
10
+ "begin_suppress_tokens": [
11
+ 220,
12
+ 50256
13
+ ],
14
+ "bos_token_id": 50257,
15
+ "checkpointing": true,
16
+ "classifier_proj_size": 256,
17
+ "cross_attention": true,
18
+ "d_model": 384,
19
+ "decoder_attention_heads": 6,
20
+ "decoder_ffn_dim": 1536,
21
+ "decoder_layerdrop": 0.0,
22
+ "decoder_layers": 4,
23
+ "decoder_start_token_id": 50258,
24
+ "dropout": 0.0,
25
+ "encoder_attention_heads": 6,
26
+ "encoder_ffn_dim": 1536,
27
+ "encoder_layerdrop": 0.0,
28
+ "encoder_layers": 4,
29
+ "eos_token_id": 50257,
30
+ "init_std": 0.02,
31
+ "is_encoder_decoder": true,
32
+ "mask_feature_length": 10,
33
+ "mask_feature_min_masks": 0,
34
+ "mask_feature_prob": 0.0,
35
+ "mask_time_length": 10,
36
+ "mask_time_min_masks": 2,
37
+ "mask_time_prob": 0.05,
38
+ "max_rel_dist": 15,
39
+ "max_source_positions": 1500,
40
+ "max_target_positions": 448,
41
+ "median_filter_width": 7,
42
+ "model_type": "whisper",
43
+ "n_audio_ctx": 1500,
44
+ "n_audio_head": 16,
45
+ "n_audio_layer": 24,
46
+ "n_audio_state": 1024,
47
+ "n_mels": 80,
48
+ "n_text_ctx": 448,
49
+ "n_text_head": 16,
50
+ "n_text_layer": 20,
51
+ "n_text_state": 1024,
52
+ "num_hidden_layers": 4,
53
+ "num_mel_bins": 80,
54
+ "pad_token_id": 50257,
55
+ "scale_embedding": false,
56
+ "torch_dtype": "float32",
57
+ "transformers_version": "4.47.0",
58
+ "use_cache": true,
59
+ "use_weighted_layer_sum": false,
60
+ "vocab_size": 51865
61
+ }
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "begin_suppress_tokens": [
4
+ 220,
5
+ 50256
6
+ ],
7
+ "bos_token_id": 50257,
8
+ "decoder_start_token_id": 50258,
9
+ "eos_token_id": 50257,
10
+ "pad_token_id": 50257,
11
+ "transformers_version": "4.47.0"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ead546823f5ff3feaccdc9e6d236f78a407f7954cbf4f383e164d84499b528d1
3
+ size 3204676144