error loading model
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'mllama'
llama_load_model_from_file: failed to load model
llama_model_loader: loaded meta data with 26 key-value pairs and 396 tensors from /home/haoze/.cache/huggingface/hub/models--leafspark--Llama-3.2-11B-Vision-Instruct-GGUF/snapshots/b40710bc55c0137565d23c06f37352082b17937d/./Llama-3.2-11B-Vision-Instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 7
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = smaug-bpe
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 24: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q8_0: 282 tensors
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'mllama'
llama_load_model_from_file: failed to load model
ValueError Traceback (most recent call last)
Cell In[9], line 3
1 from llama_cpp import Llama
----> 3 llm = Llama.from_pretrained(
4 repo_id="leafspark/Llama-3.2-11B-Vision-Instruct-GGUF",
5 filename="Llama-3.2-11B-Vision-Instruct.Q8_0.gguf",
6 )
8 llm.create_chat_completion(
9 messages = [
10 {
(...)
14 ]
15 )
File ~/.local/lib/python3.10/site-packages/llama_cpp/llama.py:2354, in Llama.from_pretrained(cls, repo_id, filename, additional_files, local_dir, local_dir_use_symlinks, cache_dir, **kwargs)
2351 model_path = os.path.join(local_dir, filename)
2353 # loading the first file of a sharded GGUF loads all remaining shard files in the subfolder
-> 2354 return cls(
2355 model_path=model_path,
2356 **kwargs,
2357 )
File ~/.local/lib/python3.10/site-packages/llama_cpp/llama.py:369, in Llama.init(self, model_path, n_gpu_layers, split_mode, main_gpu, tensor_split, rpc_servers, vocab_only, use_mmap, use_mlock, kv_overrides, seed, n_ctx, n_batch, n_ubatch, n_threads, n_threads_batch, rope_scaling_type, pooling_type, rope_freq_base, rope_freq_scale, yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow, yarn_orig_ctx, logits_all, embedding, offload_kqv, flash_attn, last_n_tokens_size, lora_base, lora_scale, lora_path, numa, chat_format, chat_handler, draft_model, tokenizer, type_k, type_v, spm_infill, verbose, **kwargs)
364 if not os.path.exists(model_path):
365 raise ValueError(f"Model path does not exist: {model_path}")
367 self._model = self.stack.enter_context(
368 contextlib.closing(
--> 369 internals.LlamaModel(
370 path_model=self.model_path,
371 params=self.model_params,
372 verbose=self.verbose,
373 )
374 )
375 )
377 # Override tokenizer
378 self.tokenizer = tokenizer or LlamaTokenizer(self)
File ~/.local/lib/python3.10/site-packages/llama_cpp/_internals.py:56, in LlamaModel.init(self, path_model, params, verbose)
51 model = llama_cpp.llama_load_model_from_file(
52 self.path_model.encode("utf-8"), self.params
53 )
55 if model is None:
---> 56 raise ValueError(f"Failed to load model from file: {path_model}")
58 self.model = model
60 def free_model():
ValueError: Failed to load model from file: /home/haoze/.cache/huggingface/hub/models--leafspark--Llama-3.2-11B-Vision-Instruct-GGUF/snapshots/b40710bc55c0137565d23c06f37352082b17937d/./Llama-3.2-11B-Vision-Instruct.Q8_0.gguf
This needs to be loaded via Ollama: https://huggingface.co/leafspark/Llama-3.2-11B-Vision-Instruct-GGUF/discussions/2
Hi,
I get the same error as above when trying to load with llama-cpp-python. There is no way to make this work with llama-cpp? Llama-cpp is faster than Ollama.
Thx!
Hi
@songhieng
,
Not with llama-cpp, it seems the needed modification are not happening for the moment.
Will use ollama for the vision part.