llama.cpp inference too slow?
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
build: 4397 (a813badb) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
I used params with -m QVQ-72B-Preview-Q4_K_S.gguf --mmproj mmproj-QVQ-72B-Preview-f16.gguf
THE GPU VRAM was not fulfilled, token speed is too slow, I don't know why?
below is the log:
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 0
clip_model_load: minicpmv_projector: 0
clip_model_load: model size: 1334.96 MB
clip_model_load: metadata size: 0.18 MB
clip_model_load: params backend buffer size = 1334.96 MB (521 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 198.93 MB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 80
llama_kv_cache_init: CPU KV buffer size = 1280.00 MiB
llama_new_context_with_model: KV self size = 1280.00 MiB, K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1287.53 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 2806
llama_new_context_with_model: graph splits = 1124 (with bs=512), 1 (with bs=1)
You should offload layers with -ngl X
where X is the layers to offload, not sure how many you can fit at Q4, maybe all? Start with -ngl 80
and if that crashes lower the number by 5 until it works