`dewabrata/cerita_seru_70B`

Model Description

This is the original version of the LLaMA 70B model fine-tuned for generating creative stories. The model retains full FP16 precision and is optimized for high-quality text generation tasks.

Key Features

Base Model: LLaMA 70B
Precision: FP16 for high accuracy
Task: Text generation
Performance: Designed for high-quality text generation requiring substantial GPU memory.

Usage

You can use this model for text generation tasks with the Hugging Face Transformers library or with vLLM for efficient inference.

Example Code with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "dewabrata/cerita_seru_70B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate text
prompt = "Ceritakan tentang Rina, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Code with vLLM

from vllm import LLM, SamplingParams

# Load model with vLLM
model_name = "dewabrata/cerita_seru_70B"
llm = LLM(model_name)

# Generate text
prompt = "Ceritakan tentang Rina, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=500
)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].text)

Performance

The full-precision model delivers the highest accuracy and text quality but requires significant computational resources for inference.

Resource Requirements

Memory: ~80GB VRAM for full FP16 LLaMA 70B
Inference Speed: Slower than quantized versions due to higher computational complexity.

Limitations

Hardware Requirements: This model requires GPUs with at least 80GB VRAM or distributed multi-GPU setups.
Latency: Higher latency compared to quantized models due to full-precision computations.

Training Details

Base Model: LLaMA 70B
Fine-tuning Dataset: Custom dataset for storytelling tasks.
Precision: FP16 for maximum performance.

How to Deploy

You can deploy this model on Hugging Face Spaces or use it locally for inference. For best performance, use GPUs like NVIDIA A100 or similar with sufficient VRAM.

Citation

If you use this model, please cite:

@misc{dewabrata2024,
  author = {Dewabrata},
  title = {Cerita Panas Generator - LLaMA 70B},
  year = {2024},
  howpublished = {\url{https://huggingface.co/dewabrata/cerita_seru_70B}},
}

License

The model inherits the license of the base LLaMA 70B model. Please ensure compliance with its terms before using this model.