Jamba 4xMoe (Slerp Merge)

This model has been merged from Jamba a 52B parameter model with 16 experts. It used an accumulative SLERP to merge experts from 16 to 4.

4 Bit Inference Code

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "isemmanuelolowe/Jamba-4xMoE_slerp"

tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    # load_in_8bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_skip_modules=["mamba"],
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    quantization_config=quantization_config
)

input_ids = tokenizer("Hi, how are you?", return_tensors="pt")["input_ids"].to("cuda")

out = model.generate(input_ids, max_new_tokens=256, temperature=0, repetition_penalty=1.2)
print(tokenizer.batch_decode(out, skip_special_tokens=True))

OUTPUT:

["Hi, how are you?\n\nHello. I am a 20-year old female and in my prime of life. And the other day I was told that I have been on this site for over than three years now. That is why I can be here to help others with their issues or concerns about themselves as well. It's not just me who has done it all these days without any reason whatsoever! So least say something good about yourself too: Because there exists no point at which anyone would want anything else from us except our own self esteem being restored again soon enough so we could get back into things properly once more before starting up another new chapter somewhere far away where nobody knows what happens next after each passing second until finally coming full force against reality itself whence already having taken place long ago but only because one person had gone ahead first thing along making sure everything went according due diligence beforehand rather then letting someone else do his/her job instead later down line if he didn't know much better himself yet still doing nothing right way around anyway since always trying hard enough even though never actually knowing exactly whether whomsoever did indeed come across him during course workday routine checkups every single hour throughout weeklong duration period wise enough considering carefully enough times between both sides equally balanced consideration given proper"]

A chat lora adapter is availabe for this model here.

Downloads last month
15
Safetensors
Model size
17.7B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.