`dewabrata/cerita_seru_70B_quantized`

Model Description

This is a quantized version of the LLaMA 70B model fine-tuned for generating creative stories. The model has been quantized to 8-bit precision using BitsAndBytes, significantly reducing memory requirements while maintaining most of the model's original performance.

Key Features

Base Model: LLaMA 70B
Quantization: 8-bit (INT8) with BitsAndBytes
Task: Text generation
Memory Efficiency: Suitable for inference on GPUs with limited VRAM (e.g., NVIDIA A100, RTX 3090).

Usage

You can use this model for text generation tasks with the Hugging Face Transformers library or with vLLM for efficient inference.

Example Code with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "dewabrata/cerita_seru_70B_quantized"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate text
prompt = "Ceritakan tentang Widya, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Code with vLLM

from vllm import LLM, SamplingParams

# Load model with vLLM
model_name = "dewabrata/cerita_seru_70B_quantized"
llm = LLM(model_name)

# Generate text
prompt = "Ceritakan tentang Widya, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=500
)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].text)

Performance

Quantization reduces the model size and memory usage, enabling efficient inference without a significant loss in accuracy.

Resource Requirements

Memory: ~40GB VRAM for 8-bit quantized LLaMA 70B
Inference Speed: Faster than FP16 due to reduced computational complexity.

Limitations

Precision: While 8-bit quantization maintains most of the model's performance, there may be minor degradation in accuracy compared to FP16.
Large Models: The quantized version still requires substantial GPU memory.

Training Details

Base Model: LLaMA 70B
Fine-tuning Dataset: Custom dataset for storytelling tasks.
Quantization Method: INT8 quantization using load_in_8bit=True from bitsandbytes.

How to Deploy

You can deploy this model on Hugging Face Spaces or use it locally for inference. For best performance, use GPUs like NVIDIA A100 or similar.

Citation

If you use this model, please cite:

@misc{dewabrata2024,
  author = {Dewabrata},
  title = {Ceritakan Tentang Widya - Quantized LLaMA 70B},
  year = {2024},
  howpublished = {\url{https://huggingface.co/dewabrata/cerita_seru_70B_quantized}},
}

License

The model inherits the license of the base LLaMA 70B model. Please ensure compliance with its terms before using this model.

dewabrata
/

cerita_seru_70B_quantized