Update README.md
Browse files
README.md
CHANGED
@@ -426,12 +426,25 @@ datasets:
|
|
426 |
- allenai/MADLAD-400
|
427 |
---
|
428 |
|
429 |
-
This model has the safetensors weights for [Madlad-400](https://github.com/google-research/google-research/tree/master/madlad_400) 8B param language model
|
430 |
|
431 |
-
|
432 |
|
|
|
433 |
|
434 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
435 |
- [3B](https://huggingface.co/jbochi/madlad400-3b-mt)
|
436 |
- [7B](https://huggingface.co/jbochi/madlad400-7b-mt)
|
437 |
- [7B-BT](https://huggingface.co/jbochi/madlad400-7b-mt-bt)
|
|
|
426 |
- allenai/MADLAD-400
|
427 |
---
|
428 |
|
429 |
+
This model has the safetensors weights for the [Madlad-400](https://github.com/google-research/google-research/tree/master/madlad_400) 8B param **language model**.
|
430 |
|
431 |
+
The Python code to run inference is not ready yet.
|
432 |
|
433 |
+
The model architecture is the same as [Palm 8B](https://arxiv.org/pdf/2204.02311.pdf).
|
434 |
|
435 |
+
It's a decoder-only T5 with 32 layers, 16 query heads, 1 KV head, and 4096 embedding size.
|
436 |
+
|
437 |
+
These are the main differences relative to the original T5 architecture:
|
438 |
+
|
439 |
+
- SwiGLU Activation
|
440 |
+
- Parallel Layers
|
441 |
+
- Multi-Query Attention
|
442 |
+
- RoPE Embeddings
|
443 |
+
- Shared Input-Output Embeddings
|
444 |
+
- No biases
|
445 |
+
- Bidirectional attention
|
446 |
+
|
447 |
+
If you are looking for the language models models, here are the available versions:
|
448 |
- [3B](https://huggingface.co/jbochi/madlad400-3b-mt)
|
449 |
- [7B](https://huggingface.co/jbochi/madlad400-7b-mt)
|
450 |
- [7B-BT](https://huggingface.co/jbochi/madlad400-7b-mt-bt)
|