dewabrata/cerita_seru_70B_quantized
Model Description
This is a quantized version of the LLaMA 70B model fine-tuned for generating creative stories. The model has been quantized to 8-bit precision using BitsAndBytes, significantly reducing memory requirements while maintaining most of the model's original performance.
Key Features
- Base Model: LLaMA 70B
- Quantization: 8-bit (INT8) with
BitsAndBytes
- Task: Text generation
- Memory Efficiency: Suitable for inference on GPUs with limited VRAM (e.g., NVIDIA A100, RTX 3090).
Usage
You can use this model for text generation tasks with the Hugging Face Transformers library or with vLLM for efficient inference.
Example Code with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_name = "dewabrata/cerita_seru_70B_quantized"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
load_in_8bit=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate text
prompt = "Ceritakan tentang Widya, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Example Code with vLLM
from vllm import LLM, SamplingParams
# Load model with vLLM
model_name = "dewabrata/cerita_seru_70B_quantized"
llm = LLM(model_name)
# Generate text
prompt = "Ceritakan tentang Widya, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=500
)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].text)
Performance
Quantization reduces the model size and memory usage, enabling efficient inference without a significant loss in accuracy.
Resource Requirements
- Memory: ~40GB VRAM for 8-bit quantized LLaMA 70B
- Inference Speed: Faster than FP16 due to reduced computational complexity.
Limitations
- Precision: While 8-bit quantization maintains most of the model's performance, there may be minor degradation in accuracy compared to FP16.
- Large Models: The quantized version still requires substantial GPU memory.
Training Details
- Base Model: LLaMA 70B
- Fine-tuning Dataset: Custom dataset for storytelling tasks.
- Quantization Method: INT8 quantization using
load_in_8bit=True
frombitsandbytes
.
How to Deploy
You can deploy this model on Hugging Face Spaces or use it locally for inference. For best performance, use GPUs like NVIDIA A100 or similar.
Citation
If you use this model, please cite:
@misc{dewabrata2024,
author = {Dewabrata},
title = {Ceritakan Tentang Widya - Quantized LLaMA 70B},
year = {2024},
howpublished = {\url{https://huggingface.co/dewabrata/cerita_seru_70B_quantized}},
}
License
The model inherits the license of the base LLaMA 70B model. Please ensure compliance with its terms before using this model.
- Downloads last month
- 57