---
license: apache-2.0
language:
- id
base_model:
- meta-llama/Llama-3.1-70b
pipeline_tag: text-generation
library_name: vllm
tags:
- cerita
- quantized
- vllm
inference: true
---


# `dewabrata/cerita_seru_70B`

## Model Description

This is the original version of the LLaMA 70B model fine-tuned for generating creative stories. The model retains full FP16 precision and is optimized for high-quality text generation tasks.

---

## Key Features

- **Base Model**: [LLaMA 70B](https://huggingface.co/models)
- **Precision**: FP16 for high accuracy
- **Task**: Text generation
- **Performance**: Designed for high-quality text generation requiring substantial GPU memory.

---

## Usage

You can use this model for text generation tasks with the Hugging Face Transformers library or with [vLLM](https://github.com/vllm-project/vllm) for efficient inference.

### Example Code with Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "dewabrata/cerita_seru_70B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate text
prompt = "Ceritakan tentang Rina, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Example Code with vLLM

```python
from vllm import LLM, SamplingParams

# Load model with vLLM
model_name = "dewabrata/cerita_seru_70B"
llm = LLM(model_name)

# Generate text
prompt = "Ceritakan tentang Rina, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis."
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=500
)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].text)
```

---

## Performance

The full-precision model delivers the highest accuracy and text quality but requires significant computational resources for inference.

### Resource Requirements

- **Memory**: \~80GB VRAM for full FP16 LLaMA 70B
- **Inference Speed**: Slower than quantized versions due to higher computational complexity.

---

## Limitations

- **Hardware Requirements**: This model requires GPUs with at least 80GB VRAM or distributed multi-GPU setups.
- **Latency**: Higher latency compared to quantized models due to full-precision computations.

---

## Training Details

- **Base Model**: LLaMA 70B
- **Fine-tuning Dataset**: Custom dataset for storytelling tasks.
- **Precision**: FP16 for maximum performance.

---

## How to Deploy

You can deploy this model on Hugging Face Spaces or use it locally for inference. For best performance, use GPUs like NVIDIA A100 or similar with sufficient VRAM.

---

## Citation

If you use this model, please cite:

```bibtex
@misc{dewabrata2024,
  author = {Dewabrata},
  title = {Cerita Panas Generator - LLaMA 70B},
  year = {2024},
  howpublished = {\url{https://huggingface.co/dewabrata/cerita_seru_70B}},
}
```

---

## License

The model inherits the license of the base LLaMA 70B model. Please ensure compliance with its terms before using this model.