Failed to create inference endpoint
Issue:
I cannot start inference endpoint, the log says:
2023/12/07 10:53:21 ~ Error: ShardCannotStart
2023/12/07 10:53:21 ~ {"timestamp":"2023-12-07T01:53:21.369939Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
2023/12/07 10:53:21 ~ {"timestamp":"2023-12-07T01:53:21.369962Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Steps for reproduce:Deploy
> Inference Endpoint
> Select A10G AWS instance
Is there a way to use inference endpoint with this lora model?
Thanks in advance!
Hi
@brekk
I am not sure the inference endpoints support Lora, you should consider use the merged model (which I believe is: https://huggingface.co/alignment-handbook/zephyr-7b-sft-full right
@lewtun
?) - if not, you can merge the model yourself, please have a look at: https://huggingface.co/docs/peft/v0.7.0/en/package_reference/lora#peft.LoraModel.merge_and_unload but to merge the lora model you can just:
from peft import AutoPeftModelForCausalLM
merged_model_id = YOUR_NEW_MODEL_ID
model = AutoPeftModelForCausalLM.from_pretrained(peft_model_id)
merged_model = model.merge_and_unload()
merged_model.push_to_hub(YOUR_NEW_MODEL_ID)
it run colab t4
!pip install transformers
!pip install peft
!pip install accelerate
!pip install bitsandbytes
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
إنشاء مجلد للتخزين المؤقت
!mkdir -p /tmp/model_cache
تحميل النموذج مع إعدادات لتوفير الذاكرة
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
device_map="auto",
load_in_8bit=True, # تقليل استهلاك الذاكرة
torch_dtype=torch.float16,
offload_folder="/tmp/model_cache" # مسار التخزين المؤقت
)
تحميل النموذج المعدل (LoRA)
peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
model = PeftModel.from_pretrained(
base_model,
peft_model_id,
offload_folder="/tmp/model_cache"
)
دمج المحول
model.merge_adapter()
تحميل التوكنايزر
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer.pad_token = tokenizer.eos_token
تجهيز السؤال
prompt = "من هو نابليون بونابرت؟"
توكنة المدخلات
inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
توليد الإجابة
with torch.no_grad():
outputs = model.generate(
input_ids=inputs["input_ids"],
max_length=150, # تقليل الحد الأقصى للإجابة
num_return_sequences=1,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
فك الترميز وطباعة الإجابة
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
تنظيف الذاكرة
del model
del base_model
torch.cuda.empty_cache()