dewabrata/cerita_seru_70B_quantized

Model Description

This is a quantized version of the LLaMA 70B model fine-tuned for generating creative stories. The model has been quantized to 8-bit precision using BitsAndBytes, significantly reducing memory requirements while maintaining most of the model's original performance.


Key Features

  • Base Model: LLaMA 70B
  • Quantization: 8-bit (INT8) with BitsAndBytes
  • Task: Text generation
  • Memory Efficiency: Suitable for inference on GPUs with limited VRAM (e.g., NVIDIA A100, RTX 3090).

Usage

You can use this model for text generation tasks with the Hugging Face Transformers library or with vLLM for efficient inference.

Example Code with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "dewabrata/cerita_seru_70B_quantized"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate text
prompt = "Ceritakan tentang Widya, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Code with vLLM

from vllm import LLM, SamplingParams

# Load model with vLLM
model_name = "dewabrata/cerita_seru_70B_quantized"
llm = LLM(model_name)

# Generate text
prompt = "Ceritakan tentang Widya, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=500
)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].text)

Performance

Quantization reduces the model size and memory usage, enabling efficient inference without a significant loss in accuracy.

Resource Requirements

  • Memory: ~40GB VRAM for 8-bit quantized LLaMA 70B
  • Inference Speed: Faster than FP16 due to reduced computational complexity.

Limitations

  • Precision: While 8-bit quantization maintains most of the model's performance, there may be minor degradation in accuracy compared to FP16.
  • Large Models: The quantized version still requires substantial GPU memory.

Training Details

  • Base Model: LLaMA 70B
  • Fine-tuning Dataset: Custom dataset for storytelling tasks.
  • Quantization Method: INT8 quantization using load_in_8bit=True from bitsandbytes.

How to Deploy

You can deploy this model on Hugging Face Spaces or use it locally for inference. For best performance, use GPUs like NVIDIA A100 or similar.


Citation

If you use this model, please cite:

@misc{dewabrata2024,
  author = {Dewabrata},
  title = {Ceritakan Tentang Widya - Quantized LLaMA 70B},
  year = {2024},
  howpublished = {\url{https://huggingface.co/dewabrata/cerita_seru_70B_quantized}},
}

License

The model inherits the license of the base LLaMA 70B model. Please ensure compliance with its terms before using this model.

Downloads last month
57
Safetensors
Model size
70.6B params
Tensor type
F32
·
FP16
·
I8
·
Inference Examples
Inference API (serverless) does not yet support vllm models for this pipeline type.

Collection including dewabrata/cerita_seru_70B_quantized