I loaded DeepSeek-V3-Q5_K_M up on my 10yrs old old Tesla M40 (Dell C4130)

#8
by gng2info - opened

I loaded DeepSeek-V3-Q5_K_M up on my 10yrs old old Tesla M40 (Dell C4130) and got the result below, unfortunately I do not understand all the output, and I did not tweak the settings. I believe the Dell R740 will be a bit faster as the processor and ram are newer and faster.

XXXXXXXXXXXXXXX:/llamaccp/llama.cpp/build/bin$ ./llama-cli -m /home/username/mymodels/DeepSeek-V3-GGUF/DeepSeek-V3-Q5_K_M/DeepSeek-V3-Q5_K_M-00001-of-00010.gguf -p "I believe the meaning of life is" -n 128
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: Tesla M40 24GB, compute capability 5.2, VMM: yes
Device 1: Tesla M40 24GB, compute capability 5.2, VMM: yes
Device 2: Tesla M40 24GB, compute capability 5.2, VMM: yes
Device 3: Tesla M40 24GB, compute capability 5.2, VMM: yes
build: 4455 (1204f972) with cc (Ubuntu 11.4.0-1ubuntu1
22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file: using device CUDA0 (Tesla M40 24GB) - 22827 MiB free
llama_model_load_from_file: using device CUDA1 (Tesla M40 24GB) - 22827 MiB free
llama_model_load_from_file: using device CUDA2 (Tesla M40 24GB) - 22827 MiB free
llama_model_load_from_file: using device CUDA3 (Tesla M40 24GB) - 22827 MiB free
llama_model_loader: additional 9 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 46 key-value pairs and 1025 tensors from /home/garfieldd/mymodels/DeepSeek-V3-GGUF/DeepSeek-V3-Q5_K_M/DeepSeek-V3-Q5_K_M-00001-of-00010.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek V3 BF16
llama_model_loader: - kv 3: general.size_label str = 256x20B
llama_model_loader: - kv 4: deepseek2.block_count u32 = 61
llama_model_loader: - kv 5: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 6: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 7: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 8: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 9: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 10: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 13: general.file_type u32 = 17
llama_model_loader: - kv 14: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 15: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 16: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 17: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 18: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 19: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 20: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 21: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 22: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 23: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 24: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 25: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 26: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 27: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 28: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 29: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 30: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 32: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<▒...
llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 36: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 37: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 40: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 41: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: split.no u16 = 0
llama_model_loader: - kv 44: split.count u16 = 10
llama_model_loader: - kv 45: split.tensors.count i32 = 1025
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q5_K: 606 tensors
llama_model_loader: - type q6_K: 58 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deepseek2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 129280
llm_load_print_meta: n_merges = 127741
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 163840
llm_load_print_meta: n_embd = 7168
llm_load_print_meta: n_layer = 61
llm_load_print_meta: n_head = 128
llm_load_print_meta: n_head_kv = 128
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 192
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 24576
llm_load_print_meta: n_embd_v_gqa = 16384
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18432
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = yarn
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 671B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 671.03 B
llm_load_print_meta: model size = 442.74 GiB (5.67 BPW)
llm_load_print_meta: general.name = DeepSeek V3 BF16
llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 131 'Ä'
llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>'
llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>'
llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>'
llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead = 3
llm_load_print_meta: n_lora_q = 1536
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 2048
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.1000
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/62 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 47272.10 MiB
llm_load_tensors: CPU_Mapped model buffer size = 46259.40 MiB
llm_load_tensors: CPU_Mapped model buffer size = 46259.40 MiB
llm_load_tensors: CPU_Mapped model buffer size = 46259.40 MiB
llm_load_tensors: CPU_Mapped model buffer size = 46259.40 MiB
llm_load_tensors: CPU_Mapped model buffer size = 46259.40 MiB
llm_load_tensors: CPU_Mapped model buffer size = 46259.40 MiB
llm_load_tensors: CPU_Mapped model buffer size = 46259.40 MiB
llm_load_tensors: CPU_Mapped model buffer size = 47683.33 MiB
llm_load_tensors: CPU_Mapped model buffer size = 34597.17 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: CPU KV buffer size = 19520.00 MiB
llama_new_context_with_model: KV self size = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 5478.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 88.01 MiB
llama_new_context_with_model: graph nodes = 5025
llama_new_context_with_model: graph splits = 1148 (with bs=512), 1 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 36

system_info: n_threads = 36 (n_threads_batch = 36) / 72 | CUDA : ARCHS = 520,610,700,750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 4015541455
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1

I believe the meaning of life is to create meaning.

What does this mean?

It means that the meaning of life is not something that is given to us, but something that we create for ourselves. It is up to each individual to decide what is important to them and to pursue those things with passion and purpose.

There are many different ways to create meaning in life. Some people find meaning in their relationships with others, while others find meaning in their work or their hobbies. Some people find meaning in their spiritual beliefs, while others find meaning in their commitment to making the world a better place.

The important thing is to find what is meaningful to you and to live your life

llama_perf_sampler_print: sampling time = 16.54 ms / 136 runs ( 0.12 ms per token, 8223.98 tokens per second)
llama_perf_context_print: load time = 54731.27 ms
llama_perf_context_print: prompt eval time = 2669.95 ms / 8 tokens ( 333.74 ms per token, 3.00 tokens per second)
llama_perf_context_print: eval time = 63980.47 ms / 127 runs ( 503.78 ms per token, 1.98 tokens per second)
llama_perf_context_print: total time = 66731.19 ms / 135 tokens

I increase my CPU threads and got a small jump in T/s, is there a setting I can implement to have the cpu do some of this work?

llama_perf_sampler_print: sampling time = 21.67 ms / 136 runs ( 0.16 ms per token, 6276.25 tokens per second)
llama_perf_context_print: load time = 32998.98 ms
llama_perf_context_print: prompt eval time = 2104.28 ms / 8 tokens ( 263.04 ms per token, 3.80 tokens per second)
llama_perf_context_print: eval time = 60578.38 ms / 127 runs ( 477.00 ms per token, 2.10 tokens per second)
llama_perf_context_print: total time = 62775.08 ms / 135 tokens

Unsloth AI org

You can try offloading layers to the GPU and use Q4_0 KV cache and increase threads via:

For 1x 24GB GPU:

./llama.cpp/llama-cli
    --model unsloth/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_XS/DeepSeek-V3-Q2_K_XS-00001-of-00005.gguf
    --cache-type-k q4_0
    --threads 32
    --prompt '<|User|>What is 1+1?<|Assistant|>'
    --n-gpu-layers 5

For 4x24GB GPU - I think you can do 14 to 20 layers (more testing is needed)

./llama.cpp/llama-cli
    --model unsloth/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_XS/DeepSeek-V3-Q2_K_XS-00001-of-00005.gguf
    --cache-type-k q4_0
    --threads 32
    --prompt '<|User|>What is 1+1?<|Assistant|>'
    --n-gpu-layers 14

You can also use -nkvo to compute attention on CPU and keep it in RAM while squeezing more layers into GPU. Helps with generation T/s for me.

Sign up or log in to comment