MoMv4-1.58bits / README.md
Ostixe360's picture
Update README.md
a3e67d9 verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- moe
- moah
- mod
datasets:
- Locutusque/UltraTextbooks
---
# Model Card for Model ID
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
MoM: Mixture of Mixture
This Model is a first test to combine [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) architecture with mixture of attention head and mixture of depth.
Attention layers only are in bf16 precision and the rest is in 1.58bits precision
17M over a total of 1025M parameters are in bf16 precision ~ 1.7% of the parameters are in bf16
The goal is to developpe and test if this kind of architectures have not too much quality loss for a fast inference.
- **Model type:** Mixture of attention head mixture of depth and mixture of expert with 1.58bits linear layer excpeted for **attention**
- **License:** Apache licence 2.0
### Model Sources [optional]
- **Repository:** https://github.com/ostix360/optimized-LLM
## How to Get Started with the Model
If you want to test this model please look at this repo at this [commit](https://github.com/ostix360/optimized-LLM/tree/796cfe43cf16461b92102cf0f41e8960cd91340b)
## Training Details
- **wandb**: [training detail](https://wandb.ai/ostix360/Mixture%20of%20mixture%20(mod,%20moah%20moe)/runs/0ayclh2i)
### Training Data
We use the first ~0.5B tokens of Locutusque/UltraTextbooks to train this model
### Training Procedure
We use adam-8 bits with default betas and epsilon values
#### Preprocessing [optional]
The data fit the model max length i.e. 512 tokens
#### Training Hyperparameters
Please look at the wandb meta data or the train.py in the repo to see the hyperparameters
## Technical Specifications [optional]
### Compute Infrastructure
#### Hardware
- one 4070 ti GPU
#### Software
- pytorch, transformers etc