Advice on running llama-server with Q2_K_L quant
My primary high memory workstation is a 44 core 256 GB machine, but I get malloc/oom errors when trying to start llama-server:
./llama-server -m /models/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf -c 0 --cache-type-k q5_0
gml_aligned_malloc: insufficient memory (attempted to allocate 473360.00 MB)
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 496353935360
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
common_init_from_params: failed to create context with model '/media/vmajor/AI2/models/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf'
srv load_model: failed to load model, '/models/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf'
main: exiting due to model loading error
Does anyone have any advice on how to overcome this? I can of course increase swap but that will cause a significant performance regression. I can try offloading a few layers to the GPU but this system only has an RTX 3060 with 12 GB VRAM so that will not help that much.
Do you know if this is an old llama.cpp version?
Another possibility is maybe you disabled mmaping? Ie see https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#no-memory-mapping
I think llama.cpp has a environment variable to turn it on or off - forgot where though.
I'm confused why your machine is going out of memory - it has 256GB which can literally preload everything and it should work fine
-c 0
tries to allocate memory for the maximum context size. Experiment with reasonable values like -c 8192
thank you for this... and of course, that has a direct implication for the KV cache that was causing the oom. I really just did not think that far during my quick test... it works now.