Issue to run the model on Ollama.
I try to run the granite-3b-code-instruct-GGUF model via ollama, and I have an error during the execution.
Error: llama runner process has terminated: signal: abort trap error:done_getting_tensors: wrong number of tensors; expected 514, got 418
In the server.log, I can see a weird error, mentioning invalid character :
tokenizer.ggml.merges arr[str,48891] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
Any idea ? Could it be linked to the model generation in GGUF format ?
Thanks
Server.log content :
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBR9B03F/VMtH3VWyPUFB62BLM4TflaZi/IeFPFb9Lpt
2024/05/31 12:11:36 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-05-31T12:11:36.215+02:00 level=INFO source=images.go:729 msg="total blobs: 0"
time=2024-05-31T12:11:36.216+02:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
time=2024-05-31T12:11:36.218+02:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.1.39)"
time=2024-05-31T12:11:36.220+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/s2/7qnwtxp15mngkms4lmj0v0qc0000gn/T/ollama2163464816/runners
time=2024-05-31T12:11:36.316+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-05-31T12:11:36.316+02:00 level=INFO source=types.go:71 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="32.0 GiB" available="0 B"
[GIN] 2024/05/31 - 12:12:03 | 200 | 746.844µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/31 - 12:12:14 | 201 | 7.380824655s | 127.0.0.1 | POST "/api/blobs/sha256:5bd783ab3925f425f17764fd34c1f7119fb64a023ccf9dd48654c3c3f252a8ff"
[GIN] 2024/05/31 - 12:12:22 | 200 | 7.957783124s | 127.0.0.1 | POST "/api/create"
[GIN] 2024/05/31 - 12:12:43 | 200 | 38.846µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/31 - 12:12:43 | 200 | 939.012µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/05/31 - 12:12:43 | 200 | 378.844µs | 127.0.0.1 | POST "/api/show"
time=2024-05-31T12:12:44.277+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=0 memory.available="0 B" memory.required.full="2.8 GiB" memory.required.partial="268.9 MiB" memory.required.kv="640.0 MiB" memory.weights.total="2.0 GiB" memory.weights.repeating="1.9 GiB" memory.weights.nonrepeating="98.4 MiB" memory.graph.full="152.0 MiB" memory.graph.partial="204.4 MiB"
time=2024-05-31T12:12:44.282+02:00 level=INFO source=server.go:338 msg="starting llama server" cmd="/var/folders/s2/7qnwtxp15mngkms4lmj0v0qc0000gn/T/ollama2163464816/runners/cpu_avx2/ollama_llama_server --model /Users/vperrin/.ollama/models/blobs/sha256-5bd783ab3925f425f17764fd34c1f7119fb64a023ccf9dd48654c3c3f252a8ff --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 53541"
time=2024-05-31T12:12:44.295+02:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-31T12:12:44.295+02:00 level=INFO source=server.go:526 msg="waiting for llama runner to start responding"
time=2024-05-31T12:12:44.296+02:00 level=INFO source=server.go:564 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=2986 commit="74f33adf" tid="0x7ff84b8c3100" timestamp=1717150364
INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff84b8c3100" timestamp=1717150364 total_threads=8
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="53541" tid="0x7ff84b8c3100" timestamp=1717150364
time=2024-05-31T12:12:44.799+02:00 level=INFO source=server.go:564 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 26 key-value pairs and 514 tensors from /Users/vperrin/.ollama/models/blobs/sha256-5bd783ab3925f425f17764fd34c1f7119fb64a023ccf9dd48654c3c3f252a8ff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = granite-3b-code-instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 2048
llama_model_loader: - kv 4: llama.embedding_length u32 = 2560
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 10240
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: llama.vocab_size u32 = 49152
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 80
llama_model_loader: - kv 13: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 14: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = refact
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "", "<f...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,48891] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 0
llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 24: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q4_K: 192 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 19/49152 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 49152
llm_load_print_meta: n_merges = 48891
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2560
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 80
llm_load_print_meta: n_embd_head_k = 80
llm_load_print_meta: n_embd_head_v = 80
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2560
llm_load_print_meta: n_embd_v_gqa = 2560
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 10240
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.48 B
llm_load_print_meta: model size = 1.98 GiB (4.89 BPW)
llm_load_print_meta: general.name = granite-3b-code-instruct
llm_load_print_meta: BOS token = 0 '<|endoftext|>'
llm_load_print_meta: EOS token = 0 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: PAD token = 0 '<|endoftext|>'
llm_load_print_meta: LF token = 145 'Ä'
llm_load_print_meta: EOT token = 0 '<|endoftext|>'
llm_load_tensors: ggml ctx size = 0.23 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 514, got 418
llama_load_model_from_file: exception loading model
libc++abi: terminating due to uncaught exception of type std::runtime_error: done_getting_tensors: wrong number of tensors; expected 514, got 418
time=2024-05-31T12:12:45.049+02:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: abort trap error:done_getting_tensors: wrong number of tensors; expected 514, got 418"
[GIN] 2024/05/31 - 12:12:45 | 500 | 1.327839226s | 127.0.0.1 | POST "/api/chat"
The 3B and 8B (instruct) models are not yet supported in Ollama. You have to wait until the next release or try this Llamafile I created: https://huggingface.co/sroecker/granite-3b-code-instruct-llamafile/tree/main
Just make it executable (chmod +x) and run it.
Granite-3b on Ollama => https://ollama.com/library/granite-code:3b
yeah currently this model is only working with llama.cpp
its not working with LM Studio or ollama.
maybe it should fix when they update to new llama.cpp release? (not sure)