Upload Echo
Browse files- README.md +21 -80
- config.json +61 -0
- generation_config.json +12 -0
- model.safetensors +3 -0
README.md
CHANGED
@@ -1,38 +1,10 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
config = WhisperConfig(
|
5 |
-
n_mels=80,
|
6 |
-
n_audio_ctx=1500,
|
7 |
-
n_audio_state=1024,
|
8 |
-
n_audio_head=16,
|
9 |
-
n_audio_layer=24,
|
10 |
-
vocab_size=51865,
|
11 |
-
n_text_ctx=448,
|
12 |
-
n_text_state=1024,
|
13 |
-
n_text_head=16,
|
14 |
-
n_text_layer=20,
|
15 |
-
max_rel_dist=15,
|
16 |
-
cross_attention=True,
|
17 |
-
checkpointing=True,
|
18 |
-
base=10000,
|
19 |
-
bos_token_id = 50257,
|
20 |
-
eos_token_id = 50257,
|
21 |
-
pad_token_id = 50257,
|
22 |
-
decoder_start_token_id = 50258,
|
23 |
-
is_encoder_decoder = True,
|
24 |
-
init_std=0.02,
|
25 |
-
)
|
26 |
-
|
27 |
-
model = Echo(config).to(device)
|
28 |
-
""
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
Dynamic Base Adjustment
|
33 |
Self-Adjusting Parameters: The model dynamically adjusts the base parameter in response to training loss, optimizing positional embeddings in real-time. This adaptive mechanism enhances the model's ability to fine-tune itself during training, ensuring better performance and efficiency.
|
34 |
|
35 |
-
|
36 |
Orthogonally Initialized Rotation Matrix: This component combines rotary embeddings with an orthogonally initialized rotation matrix, providing robust and stable positional embeddings. This novel approach enhances the model’s capacity to represent positional information effectively.
|
37 |
|
38 |
LearnedSinusoidalEmbeddings
|
@@ -41,63 +13,32 @@ Learned Sinusoidal Embeddings with Checkpointing: This unique integration of sin
|
|
41 |
MultiHeadAttention
|
42 |
Dynamic Positional Bias: Supports rotary embeddings and includes relative positional bias, capturing dependencies effectively. The attention mechanism is finely tuned with a dynamically adjustable base parameter, providing flexibility and precision.
|
43 |
|
44 |
-
HybridAttention
|
45 |
Combining Local and Global Attention: This component leverages both local and global attention mechanisms, ensuring that the model captures both fine-grained and broad context. The sliding window approach for local attention enhances its ability to process long sequences efficiently.
|
46 |
|
47 |
-
DynamicConvAttention
|
48 |
Integrating Convolution and Attention: This component enriches feature representation by combining convolutional layers with attention mechanisms, enabling the model to extract local context while attending to global information simultaneously.
|
49 |
|
50 |
-
|
51 |
LayerNorm: Custom normalization with gamma and beta parameters.
|
52 |
|
53 |
Linear: Custom linear layer with batch normalization and various activation functions.
|
54 |
|
55 |
Conv1d: Custom 1D convolution layer with Kaiming initialization.
|
56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
|
58 |
-
|
59 |
-
$$
|
60 |
-
n_{\text{state}}, n_{\text{head}}, \text{num\_rotations}, \text{base}=10000, \text{checkpointing}=\text{False}
|
61 |
-
$$
|
62 |
-
|
63 |
-
The hidden dimension \( \text{h\_dim} \) is calculated as:
|
64 |
-
$$
|
65 |
-
\text{h\_dim} = \frac{n_{\text{state}}}{n_{\text{head}}}
|
66 |
-
$$
|
67 |
-
|
68 |
-
The parameters \texttt{thetas} and \texttt{rotation\_pairs} are initialized as:
|
69 |
-
$$
|
70 |
-
\texttt{thetas} = \mathbf{0}
|
71 |
-
$$
|
72 |
-
$$
|
73 |
-
\texttt{rotation\_pairs} = \text{rand}(\text{num\_rotations}, 2) \times \text{h\_dim}
|
74 |
-
$$
|
75 |
-
|
76 |
-
The rotation matrix is an identity matrix:
|
77 |
-
$$
|
78 |
-
\texttt{rotation\_matrix} = \mathbf{I}_{\text{h\_dim}}
|
79 |
-
$$
|
80 |
-
|
81 |
-
The inverse frequency is computed as:
|
82 |
-
$$
|
83 |
-
\texttt{inv\_freq} = \frac{1.0}{\text{base}^{\frac{\text{torch.arange}(0, \text{h\_dim}, 2)}{\text{h\_dim}}}}
|
84 |
-
$$
|
85 |
-
|
86 |
-
The Givens rotation matrix \( G \) is defined as:
|
87 |
-
$$
|
88 |
-
G = \mathbf{I}_{n_{\text{state}}}
|
89 |
-
$$
|
90 |
-
$$
|
91 |
-
G_{ii} = \cos(\theta), \quad G_{ij} = -\sin(\theta)
|
92 |
-
$$
|
93 |
-
$$
|
94 |
-
G_{ji} = \sin(\theta), \quad G_{jj} = \cos(\theta)
|
95 |
-
$$
|
96 |
-
|
97 |
-
The rotary orthogonal matrix \( R \) used in the forward pass is computed as:
|
98 |
-
$$
|
99 |
-
R = \text{rotation\_matrix} \cdot G
|
100 |
-
$$
|
101 |
-
|
102 |
-
$$ \mathbf{x}{\text{transformed}} = \mathbf{x} \cdot \left( \prod{k=1}^{N} G_k \right) \cdot R $$
|
103 |
|
|
|
1 |
+
---
|
2 |
+
{}
|
3 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
Dynamic Base Adjustment
|
5 |
Self-Adjusting Parameters: The model dynamically adjusts the base parameter in response to training loss, optimizing positional embeddings in real-time. This adaptive mechanism enhances the model's ability to fine-tune itself during training, ensuring better performance and efficiency.
|
6 |
|
7 |
+
RotaryEmbeddingWithRotation
|
8 |
Orthogonally Initialized Rotation Matrix: This component combines rotary embeddings with an orthogonally initialized rotation matrix, providing robust and stable positional embeddings. This novel approach enhances the model’s capacity to represent positional information effectively.
|
9 |
|
10 |
LearnedSinusoidalEmbeddings
|
|
|
13 |
MultiHeadAttention
|
14 |
Dynamic Positional Bias: Supports rotary embeddings and includes relative positional bias, capturing dependencies effectively. The attention mechanism is finely tuned with a dynamically adjustable base parameter, providing flexibility and precision.
|
15 |
|
16 |
+
HybridAttention
|
17 |
Combining Local and Global Attention: This component leverages both local and global attention mechanisms, ensuring that the model captures both fine-grained and broad context. The sliding window approach for local attention enhances its ability to process long sequences efficiently.
|
18 |
|
19 |
+
DynamicConvAttention
|
20 |
Integrating Convolution and Attention: This component enriches feature representation by combining convolutional layers with attention mechanisms, enabling the model to extract local context while attending to global information simultaneously.
|
21 |
|
22 |
+
Model Components
|
23 |
LayerNorm: Custom normalization with gamma and beta parameters.
|
24 |
|
25 |
Linear: Custom linear layer with batch normalization and various activation functions.
|
26 |
|
27 |
Conv1d: Custom 1D convolution layer with Kaiming initialization.
|
28 |
|
29 |
+
RotaryEmbeddingWithRotation: Orthogonally initialized rotary embeddings with dynamic base adjustment.
|
30 |
+
|
31 |
+
LearnedSinusoidalEmbeddings: Sinusoidal embeddings with optional checkpointing and L2 normalization.
|
32 |
+
|
33 |
+
MultiHeadAttention: Dynamic positional bias with rotary embeddings and optional caching.
|
34 |
+
|
35 |
+
ResidualAttentionBlock: Integrates self and cross-attention with GELU-activated MLP.
|
36 |
+
|
37 |
+
AudioEncoder: Convolutional layers with learned sinusoidal embeddings and rotary embeddings.
|
38 |
+
|
39 |
+
TextDecoder: Token embeddings with rotary embeddings and cross-attention.
|
40 |
+
|
41 |
+
DynamicConvAttention: Combines convolution and attention for enriched feature extraction.
|
42 |
|
43 |
+
HybridAttention: Merges local and global attention mechanisms using sliding window and multi-head attention.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
|
config.json
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"activation_dropout": 0.0,
|
3 |
+
"activation_function": "gelu",
|
4 |
+
"apply_spec_augment": false,
|
5 |
+
"architectures": [
|
6 |
+
"Echo"
|
7 |
+
],
|
8 |
+
"attention_dropout": 0.0,
|
9 |
+
"base": 10000,
|
10 |
+
"begin_suppress_tokens": [
|
11 |
+
220,
|
12 |
+
50256
|
13 |
+
],
|
14 |
+
"bos_token_id": 50257,
|
15 |
+
"checkpointing": true,
|
16 |
+
"classifier_proj_size": 256,
|
17 |
+
"cross_attention": true,
|
18 |
+
"d_model": 384,
|
19 |
+
"decoder_attention_heads": 6,
|
20 |
+
"decoder_ffn_dim": 1536,
|
21 |
+
"decoder_layerdrop": 0.0,
|
22 |
+
"decoder_layers": 4,
|
23 |
+
"decoder_start_token_id": 50258,
|
24 |
+
"dropout": 0.0,
|
25 |
+
"encoder_attention_heads": 6,
|
26 |
+
"encoder_ffn_dim": 1536,
|
27 |
+
"encoder_layerdrop": 0.0,
|
28 |
+
"encoder_layers": 4,
|
29 |
+
"eos_token_id": 50257,
|
30 |
+
"init_std": 0.02,
|
31 |
+
"is_encoder_decoder": true,
|
32 |
+
"mask_feature_length": 10,
|
33 |
+
"mask_feature_min_masks": 0,
|
34 |
+
"mask_feature_prob": 0.0,
|
35 |
+
"mask_time_length": 10,
|
36 |
+
"mask_time_min_masks": 2,
|
37 |
+
"mask_time_prob": 0.05,
|
38 |
+
"max_rel_dist": 15,
|
39 |
+
"max_source_positions": 1500,
|
40 |
+
"max_target_positions": 448,
|
41 |
+
"median_filter_width": 7,
|
42 |
+
"model_type": "whisper",
|
43 |
+
"n_audio_ctx": 1500,
|
44 |
+
"n_audio_head": 16,
|
45 |
+
"n_audio_layer": 24,
|
46 |
+
"n_audio_state": 1024,
|
47 |
+
"n_mels": 80,
|
48 |
+
"n_text_ctx": 448,
|
49 |
+
"n_text_head": 16,
|
50 |
+
"n_text_layer": 20,
|
51 |
+
"n_text_state": 1024,
|
52 |
+
"num_hidden_layers": 4,
|
53 |
+
"num_mel_bins": 80,
|
54 |
+
"pad_token_id": 50257,
|
55 |
+
"scale_embedding": false,
|
56 |
+
"torch_dtype": "float32",
|
57 |
+
"transformers_version": "4.47.0",
|
58 |
+
"use_cache": true,
|
59 |
+
"use_weighted_layer_sum": false,
|
60 |
+
"vocab_size": 51865
|
61 |
+
}
|
generation_config.json
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_from_model_config": true,
|
3 |
+
"begin_suppress_tokens": [
|
4 |
+
220,
|
5 |
+
50256
|
6 |
+
],
|
7 |
+
"bos_token_id": 50257,
|
8 |
+
"decoder_start_token_id": 50258,
|
9 |
+
"eos_token_id": 50257,
|
10 |
+
"pad_token_id": 50257,
|
11 |
+
"transformers_version": "4.47.0"
|
12 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ead546823f5ff3feaccdc9e6d236f78a407f7954cbf4f383e164d84499b528d1
|
3 |
+
size 3204676144
|