Zyphra
/

Zamba-7B-v1

Text Generation

Inference Endpoints

Model card Files Files and versions Community

pglo commited on Jun 4, 2024

Commit

1e05475

•

1 Parent(s): 5dd2d3c

Update README.md

Files changed (1) hide show

README.md +8 -0

README.md CHANGED Viewed

@@ -45,6 +45,14 @@ outputs = model.generate(**input_ids, max_new_tokens=100)
 print(tokenizer.decode(outputs[0]))
 ```
 ## Model Details
 Zamba utilizes a unique hybrid SSM architecture. This architecture consists of a backbone of Mamba layers interspersed with a shared attention layer. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth.

 print(tokenizer.decode(outputs[0]))
 ```
+To load a different checkpoint use,  e.g., for iteration 2500,
+```python
+model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba-7B-v1", device_map="auto", torch_dtype=torch.bfloat16, revision="iter2500")
+```
+The default iteration is the fully trained model, corresponding to iteration 25156. This is the number of training iterations done starting from Zamba-phase 1 [Zyphra/Zamba-7B-v1-phase1](https://huggingface.co/Zyphra/Zamba-7B-v1-phase1). See [arXiv:2405.16712](https://arxiv.org/abs/2405.16712) for more details on training.
 ## Model Details
 Zamba utilizes a unique hybrid SSM architecture. This architecture consists of a backbone of Mamba layers interspersed with a shared attention layer. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth.