Zyphra
/

Zamba2-1.2B

transformers_zamba2

Model card Files Files and versions Community

pglo commited on Aug 26, 2024

Commit

eee85e1

·

verified ·

1 Parent(s): d9a6349

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -12,11 +12,13 @@ Zamba2-1.2B is a hybrid model composed of state-space and transformer blocks. It
 3.) We utilize rotary position embeddings in the shared attention layer.
-Zamba2-1.2B differs from our [2.7B model](https://huggingface.co/Zyphra/Zamba2-2.7B) in two ways:
 1.) Rotary position embeddings
-2.) No alternating shared attention blocks
 We found that while hybrid SSM-transformer models are perfectly capable of performing well without position embeddings, adding rotary embeddings to the shared attention block slightly improved performance. Secondly, we utilize a single attention block instead of alternating because this enables a higher flop count for the model at a given parameter budget and at smaller scales this becomes more important than the slightly faster latency.

 3.) We utilize rotary position embeddings in the shared attention layer.
+Zamba2-1.2B differs from our [2.7B model](https://huggingface.co/Zyphra/Zamba2-2.7B) in three ways:
 1.) Rotary position embeddings
+2.) No alternating shared transformer blocks
+3.) Added LoRA projectors to attention layers
 We found that while hybrid SSM-transformer models are perfectly capable of performing well without position embeddings, adding rotary embeddings to the shared attention block slightly improved performance. Secondly, we utilize a single attention block instead of alternating because this enables a higher flop count for the model at a given parameter budget and at smaller scales this becomes more important than the slightly faster latency.