--- license: apache-2.0 language: - id base_model: - meta-llama/Llama-3.1-70b pipeline_tag: text-generation library_name: vllm tags: - cerita - quantized - vllm inference: true --- # `dewabrata/cerita_seru_70B` ## Model Description This is the original version of the LLaMA 70B model fine-tuned for generating creative stories. The model retains full FP16 precision and is optimized for high-quality text generation tasks. --- ## Key Features - **Base Model**: [LLaMA 70B](https://huggingface.co/models) - **Precision**: FP16 for high accuracy - **Task**: Text generation - **Performance**: Designed for high-quality text generation requiring substantial GPU memory. --- ## Usage You can use this model for text generation tasks with the Hugging Face Transformers library or with [vLLM](https://github.com/vllm-project/vllm) for efficient inference. ### Example Code with Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load model and tokenizer model_name = "dewabrata/cerita_seru_70B" model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.float16 ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Generate text prompt = "Ceritakan tentang Rina, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis." inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.7, top_p=0.9) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Example Code with vLLM ```python from vllm import LLM, SamplingParams # Load model with vLLM model_name = "dewabrata/cerita_seru_70B" llm = LLM(model_name) # Generate text prompt = "Ceritakan tentang Rina, seorang wanita berhijab yang bersemangat menjalani hidupnya dan memiliki bakat luar biasa dalam seni lukis." sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=500 ) outputs = llm.generate([prompt], sampling_params) print(outputs[0].text) ``` --- ## Performance The full-precision model delivers the highest accuracy and text quality but requires significant computational resources for inference. ### Resource Requirements - **Memory**: \~80GB VRAM for full FP16 LLaMA 70B - **Inference Speed**: Slower than quantized versions due to higher computational complexity. --- ## Limitations - **Hardware Requirements**: This model requires GPUs with at least 80GB VRAM or distributed multi-GPU setups. - **Latency**: Higher latency compared to quantized models due to full-precision computations. --- ## Training Details - **Base Model**: LLaMA 70B - **Fine-tuning Dataset**: Custom dataset for storytelling tasks. - **Precision**: FP16 for maximum performance. --- ## How to Deploy You can deploy this model on Hugging Face Spaces or use it locally for inference. For best performance, use GPUs like NVIDIA A100 or similar with sufficient VRAM. --- ## Citation If you use this model, please cite: ```bibtex @misc{dewabrata2024, author = {Dewabrata}, title = {Cerita Panas Generator - LLaMA 70B}, year = {2024}, howpublished = {\url{https://huggingface.co/dewabrata/cerita_seru_70B}}, } ``` --- ## License The model inherits the license of the base LLaMA 70B model. Please ensure compliance with its terms before using this model.