Performance

#1
by urtuuuu - opened

Does it work fine or needs fixes? Because of all the discussions on github... (That it's worse than previous models...)

I see that it works better after the corrections... but it's still probably not what it should be. In my tests, it performs chaotically (sometimes it solves puzzles that the previous L3 8b couldn't solve, and then it fails at something obvious that models generally don't have problems with).

It's weird... I just ran llama-server -v -ngl 0 -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -c 8192 . But this time DIDN'T even choose llama 3 promt style, only changed temperature to 0. And suddenly it started to answer hard questions correctly... Always gives me correct answer to this question:
I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?

PS:Snake game in python works too:) Also follows instructions to modify the game!

q8
command
build\bin\llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0 --chat-template llama3

NOT EVEN CLOSE

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? /
You start with 10 apples.

You find 3 gold coins in a river, so you now have 3 gold coins and still 10 apples.

You lose 4 apples but gain a gold coin, so you now have 11 gold coins and 6 apples.

Three birds drop 6 apples each, which is 18 apples. You now have 24 apples and 11 gold coins.

You play an online game and win 6 gold coins, but you have to share them equally with your teammates. This means that you get 2 of the 6 gold coins (since there are two teammates). So you now have 13 gold coins and still 24 apples.

The price of an apple is 0.5 coins, so you can buy 26 apples with all the coins you have (13 * 2 = 26).

You start with 24 apples and buy 26 more, which means that you now have 50 apples.

As for where the river is, it's not specified in the problem.

Weird, even q4_k_m gets this one right for me

Also tested q4 , q6 , q8 always answers are bad , used build in template and tried my own as well .

Here q4k_m

\llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0
Log start
main: build = 3452 (96952e71)
main: built with MSVC 19.29.30154.0 for x64
main: seed  = 1721866321
llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from models/new3/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   281.81 MiB
llm_load_tensors:      CUDA0 buffer size =  4403.49 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 8224
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1028.00 MiB
llama_new_context_with_model: KV self size  = 1028.00 MiB, K (f16):  514.00 MiB, V (f16):  514.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   562.07 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.07 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
main: in-suffix/prefix is specified, chat template will be disabled

system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
llm_tokenizer_bpe::check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
main: interactive mode on.
Input prefix: '<|eot_id|><|start_header_id|>user<|end_header_id|>

'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 20


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.


> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? /
What a delightful adventure you've had!

Let's break down the events step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river, so now you have 0 apples and 3 gold coins.
3. Then, you lose 4 apples, leaving you with 6 apples and 3 gold coins.
4. Three birds drop 6 apples each, adding a total of 18 apples to your collection. Now you have 24 apples and 3 gold coins.
5. You play an online game and win 6 gold coins, but you share them equally with your 2 teammates, so you get 2 gold coins (6 ÷ 3 = 2). Now you have 25 apples and 5 gold coins.
6. With the 5 gold coins, you buy apples at a price of 0.5 coins per apple. You can buy 10 apples with your 5 gold coins (5 ÷ 0.5 = 10).

Now, let's calculate how many apples you have:

You started with 24 apples and bought 10 more, so you now have:
24 + 10 = 34 apples

As for the river, it runs near a big city that has something to do with what you can spend the coins on. Since you used your gold coins to buy apples, I'm guessing the city is likely an orchard or a market where fruits are sold.

>

llama_print_timings:        load time =     871.47 ms
llama_print_timings:      sample time =     149.70 ms /   300 runs   (    0.50 ms per token,  2003.95 tokens per second)
llama_print_timings: prompt eval time =     980.85 ms /   155 tokens (    6.33 ms per token,   158.03 tokens per second)
llama_print_timings:        eval time =    2300.40 ms /   299 runs   (    7.69 ms per token,   129.98 tokens per second)
llama_print_timings:       total time =   12989.14 ms /   454 tokens

q8

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0
Log start
main: build = 3452 (96952e71)
main: built with MSVC 19.29.30154.0 for x64
main: seed  = 1721866739
llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.33 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8224
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1028.00 MiB
llama_new_context_with_model: KV self size  = 1028.00 MiB, K (f16):  514.00 MiB, V (f16):  514.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   562.07 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.07 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
main: in-suffix/prefix is specified, chat template will be disabled

system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
llm_tokenizer_bpe::check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
main: interactive mode on.
Input prefix: '<|eot_id|><|start_header_id|>user<|end_header_id|>

'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 20


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.


> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? /
What a delightful series of events!

Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river, but this doesn't affect your apple count directly.
3. You lose 4 apples, so you now have: 10 - 4 = 6 apples.
4. The birds drop 6 apples each, and there are three of them, so that's a total of 18 new apples! Now you have: 6 + 18 = 24 apples.
5. You win 6 gold coins online, but you share them equally with your 2 teammates, which means you get 2 gold coins (6 ÷ 3 = 2). You now have a total of 3 gold coins (3 from the river + 2 from the game).
6. With all your coins, you buy apples at a price of 0.5 coins per apple. Since you have 3 gold coins, you can buy: 3 ÷ 0.5 = 6 apples.

Now, let's add these new apples to your total count: 24 (from the birds) + 6 (bought with coins) = 30 apples.

As for where the river is, I'm afraid that information isn't provided in the problem statement. The city near the river could be anywhere!

this is what I get with Q4_K_M:

system

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.user

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?assistant

Let's break down the changes in your apple and gold coin counts step by step:

1. You start with 10 apples and find 3 gold coins.
2. You lose 4 apples, so you have 10 - 4 = 6 apples left.
3. You gain a gold coin, so you have 3 + 1 = 4 gold coins.
4. Three birds each drop 6 apples, so you get 3 x 6 = 18 apples. Now you have 6 + 18 = 24 apples.
5. You win 6 gold coins, but you have to share them equally with your 2 teammates. That means you get 6 / 3 = 2 gold coins. Now you have 4 + 2 = 6 gold coins.
6. You spend all your gold coins on apples. The price of an apple is 0.5 coins, so you can buy 6 / 0.5 = 12 apples.

So, after buying apples with all your gold coins, you have 24 (initial apples) + 12 (bought apples) = 36 apples.

As for the location of the river, it runs near a big city. Unfortunately, I don't have any information about a specific city that you mentioned, so I couldn't pinpoint the exact location of the river. [end of text]

Weird, even q4_k_m gets this one right for me

I already noticed that weirdness in several models. Many times, small quantized models perform better than q8 for example. I never could explain that behavior. 😮

Weird, even q4_k_m gets this one right for me

I already noticed that weirdness in several models. Many times, small quantized models perform better than q8 for example. I never could explain that behavior. 😮

For me q4k_m behave even worse

this is what I get with Q4_K_M:

system

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.user

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?assistant

Let's break down the changes in your apple and gold coin counts step by step:

1. You start with 10 apples and find 3 gold coins.
2. You lose 4 apples, so you have 10 - 4 = 6 apples left.
3. You gain a gold coin, so you have 3 + 1 = 4 gold coins.
4. Three birds each drop 6 apples, so you get 3 x 6 = 18 apples. Now you have 6 + 18 = 24 apples.
5. You win 6 gold coins, but you have to share them equally with your 2 teammates. That means you get 6 / 3 = 2 gold coins. Now you have 4 + 2 = 6 gold coins.
6. You spend all your gold coins on apples. The price of an apple is 0.5 coins, so you can buy 6 / 0.5 = 12 apples.

So, after buying apples with all your gold coins, you have 24 (initial apples) + 12 (bought apples) = 36 apples.

As for the location of the river, it runs near a big city. Unfortunately, I don't have any information about a specific city that you mentioned, so I couldn't pinpoint the exact location of the river. [end of text]

Why do you have "system " at the begging and at the end [end of text] ?

i just used ./llama-cli

./llama-cli -m /models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -p <|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n -c 500 -ngl 50

build\bin\llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0 --chat-template llama3

NOT EVEN CLOSE

Did you read my comment? Probably not. I said i did NOT choose llama3 template. llama-server -v -ngl 0 -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -c 8192. Maybe you should try llama-server interface in browser.
All at default settings. At temperature 0, always 100% correct answer. I compare it to LMSYS Chatbot Arena, seems to perform equally...
This also worked for me:
llama-cli -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --temp 0 -p "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -c 500 -ngl 50 -c 8192 --conversation

I tested myself with build-in template and my own template using commands

Here tested with suggested command :

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -p "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -c 500 -ngl 50 -c 8192 --conversation
Log start
main: build = 3452 (96952e71)
main: built with MSVC 19.29.30154.0 for x64
main: seed  = 1721894706
llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.33 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
main: chat template example: <|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
main: interactive mode on.
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system

system

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.user

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?assistant


>


Let's break down the events step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river.
3. You lose 4 apples, but gain a gold coin.
4. Three birds drop 6 apples each, so you gain 3 x 6 = 18 apples.
5. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates, so you gain 3 gold coins (6 / 2).
6. Now, you have a total of:
   3 (from the river) + 3 (gained in step 3) + 3 (from the game) = 9 gold coins.
7. You buy apples for all the coins you have. The price of an apple is 0.5 coins, so you can buy:
   9 / 0.5 = 18 apples.

Now, let's calculate the total number of apples you have:

You started with 10 apples, then gained 18 apples from the birds, and finally bought 18 apples. So, the total number of apples is:
10 + 18 + 18 = 46 apples.

As for the river, the problem doesn't provide any specific information about its location. You only know that it runs near a big city, but there's no mention of the city's name or any other details that could help you identify its location.

Results are similar to mine - no bigger difference

Correct answers around 3-4 of 10 attempts
Using groq.com 10/10

Did you try at least 10 times in a row?
How results are then.

LM-Studio @ Q5_K_M - Temp 0

Let's break down the events step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river, but this doesn't affect your apple count.
3. You lose 4 apples, so you now have 6 apples.
4. The birds drop 6 apples each, which means they dropped a total of 18 apples. Now you have 24 apples (6 + 18).
5. You play the online game and win 12 gold coins (6 x 2 teammates). This doesn't affect your apple count directly.
6. With all the coins you have now (3 from the river, 1 gained after losing 4 apples, and 12 won in the game), you have a total of 16 gold coins.
7. You buy apples for all the coins you have. Since each apple costs 0.5 coins, you can buy 32 apples (16 x 2).

So, you now have 56 apples (24 + 32). As for where the river is, it's near a big city that has something to do with what you can spend the coins on. Unfortunately, this information isn't specific enough to pinpoint an exact location. However, based on your ability to play online games and buy apples, I'm going to take a guess that the city might be somewhere in the world where internet access is widespread and there are markets or stores where you can buy apples. If you'd like to provide more context or clarify what you meant by "something to do with what you can spend the coins on," I'd be happy to try and help further!

Thanks Bartowski, this updated build of L3.1 8b performed notably better for me than the others.

However, if anybody is experiencing performance issues I recommend adjusting the system prompt. I'm using "You're a helpful assistant.", which notable increased its test score over no system prompt (which I normally set during testing).

My theory is since L3.1 uses COT fine-tuning, which negatively impacts multi-shot evaluation scores, not setting the system prompt makes L3.1 behave more like a base than instruct model (e.g. doesn't follow instructions near as accurately so it can be led by the pattern set by multi-shot examples).

Nope .. those ggufs are obsolete .
We have a new ones corrected.

@mirek190 Thanks. I'll try out the corrected ones. Is there a gguf you recommend?

@mirek190 Thanks for the link. Did anyone verify that the perplexity wasn't negatively impacted by the rope changes?

https://huggingface.co/qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF/tree/main

Edit: I noticed you wrote

"perplexity even with --repeat-penalty 1.0 is still 78.00
before rope changes and gguf update was 80.50"

I will be waiting for this PR https://github.com/ggerganov/llama.cpp/pull/8676 to be finalized before regenerating, as of now any made with that PR may be final implementation but also may not be

I will be waiting for this PR https://github.com/ggerganov/llama.cpp/pull/8676 to be finalized before regenerating, as of now any made with that PR may be final implementation but also may not be

Current job seems not fully done yet. Is far more better now but still makes more errors that version on groq.com

"Current job seems not fully done yet. Is far more better now but still makes more errors that version on groq.com"

Seeing the same. It hallucinates less in response to the same prompts on LMsys.

I tested myself with build-in template and my own template using commands

Here tested with suggested command :

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -p "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -c 500 -ngl 50 -c 8192 --conversation

UPDATE: WHERE is --temp 0? You did not copy my settings.
Very strange! I don't JUST get the right answer with these settings, but 100% ALWAYS right answer 36. And all other hard questions that i use, also usually answered correctly. Maybe i have to tell you all the steps i'm doing :) Like downloading latest llama.cpp release (llama-b3463-bin-win-vulkan-x64.zip), using Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf that i downloaded from here. What else? I don't know, it just works.

\llama-b3463-bin-win-vulkan-x64>llama-cli -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --temp 0 -p "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -c 500 -ngl 3 -c 8192 --conversation --multiline-input --color
Log start
main: build = 3463 (4226a8d1)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1721931751
llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 224
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.58 GiB (4.89 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 3 repeating layers to GPU
llm_load_tensors: offloaded 3/33 layers to GPU
llm_load_tensors: AMD Radeon(TM) Graphics buffer size = 397.50 MiB
llm_load_tensors: CPU buffer size = 4685.30 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon(TM) Graphics KV buffer size = 96.00 MiB
llama_kv_cache_init: Vulkan_Host KV buffer size = 928.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.49 MiB
llama_new_context_with_model: AMD Radeon(TM) Graphics compute buffer size = 669.48 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 323
main: chat template example: <|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • To return control to the AI, end your input with ''.
  • To return control without starting a new line, end your input with '/'.

system

system

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.user

assistant

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break down the events step by step:

  1. You start with 10 apples.
  2. You find 3 gold coins in the river.
  3. You lose 4 apples, but gain 1 gold coin. So, you now have 6 apples and 4 gold coins.
  4. Three birds drop 6 apples each, so you gain 18 apples. You now have 24 apples and 4 gold coins.
  5. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates. This means you get 2 gold coins. You now have 24 apples and 6 gold coins.
  6. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 12 apples with 6 gold coins (6 x 0.5 = 3, but you have 6 gold coins, so you can buy 12 apples). You now have 36 apples.

As for the river, the problem doesn't specify its location, so it could be anywhere.

I tested myself with build-in template and my own template using commands

Here tested with suggested command :

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -p "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -c 500 -ngl 50 -c 8192 --conversation

UPDATE: WHERE is --temp 0? You did not copy my settings.
Very strange! I don't JUST get the right answer with these settings, but 100% ALWAYS right answer 36. And all other hard questions that i use, also usually answered correctly. Maybe i have to tell you all the steps i'm doing :) Like downloading latest llama.cpp release (llama-b3463-bin-win-vulkan-x64.zip), using Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf that i downloaded from here. What else? I don't know, it just works.

\llama-b3463-bin-win-vulkan-x64>llama-cli -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --temp 0 -p "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -c 500 -ngl 3 -c 8192 --conversation --multiline-input --color
Log start
main: build = 3463 (4226a8d1)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1721931751
llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 224
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.58 GiB (4.89 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 3 repeating layers to GPU
llm_load_tensors: offloaded 3/33 layers to GPU
llm_load_tensors: AMD Radeon(TM) Graphics buffer size = 397.50 MiB
llm_load_tensors: CPU buffer size = 4685.30 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon(TM) Graphics KV buffer size = 96.00 MiB
llama_kv_cache_init: Vulkan_Host KV buffer size = 928.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.49 MiB
llama_new_context_with_model: AMD Radeon(TM) Graphics compute buffer size = 669.48 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 323
main: chat template example: <|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • To return control to the AI, end your input with ''.
  • To return control without starting a new line, end your input with '/'.

system

system

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.user

assistant

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break down the events step by step:

  1. You start with 10 apples.
  2. You find 3 gold coins in the river.
  3. You lose 4 apples, but gain 1 gold coin. So, you now have 6 apples and 4 gold coins.
  4. Three birds drop 6 apples each, so you gain 18 apples. You now have 24 apples and 4 gold coins.
  5. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates. This means you get 2 gold coins. You now have 24 apples and 6 gold coins.
  6. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 12 apples with 6 gold coins (6 x 0.5 = 3, but you have 6 gold coins, so you can buy 12 apples). You now have 36 apples.

As for the river, the problem doesn't specify its location, so it could be anywhere.

That comment you are reffering is outdated.

model q8
Second after updates rope and new gguf with temp 0 answers are correct and the same like on groq,com BUT with temp 0.6 answers are a bit worse than I getting on groq.com where getting answers for

"If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?"

"I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?"

"Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks? Be concise."

ALWAYS correct ( groq.com has no temp 0 ) 10/10 correct
but locally something around 8-9/10 times are correct.

"Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks? Be concise."

Meta-Llama-3.1-8B-Instruct-Q4_K_M says "The limiting factor is the wax, so you can make 3 candles."
GPT-4o says "You can make 4 candles with 500 grams of wax and 3 wicks." (Ok, maybe not always but sometimes wrong answer. If the best model sometimes fails, why are we even judging llama 8b :))

"Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks? Be concise."

Meta-Llama-3.1-8B-Instruct-Q4_K_M says "The limiting factor is the wax, so you can make 3 candles."
GPT-4o says "You can make 4 candles with 500 grams of wax and 3 wicks." (Ok, maybe not always but sometimes wrong answer. If the best model sometimes fails, why are we even judging llama 8b :))

But om groq.com llama 3 8b always answering correctly that question ;)

Is llama 3.1 8b working for everyone after the llama.cpp rope fix?

I'm noticing a major issue. Namely, all llama 3.1 8b ggufs regularly ignore very obvious prompt directives.

This issue doesn't appear to be native to llama 3.1 8b since it doesn't occur on LMsys despite numerous attempts, but regularly occur with all Llama 3.1 8b GGUFs (pre and post rope fix), regardless of uploader or quantization (even Q8_0), and while using temp 0 or higher.

Anyways, I use scripts to combine text segments into unique learning prompts and I make it define a related term first (to help randomize the output) and end with requesting an interesting related fact. And about half the time it doesn't define the related term first, and periodically it doesn't end with an interesting related fact.

Example (didn't start with a related term): "Define denigrate and gi specialist, but only after first defining another related college-level term. Then end by sharing an interesting fact."

Example (didn't end with interesting fact): "Define chromosphere and feasibility study, but only after first defining another related college-level term. Then end by sharing an interesting related fact."

Does anybody have any idea what may be causing this? The other 8b class LLMs, including Mistral, Qwen2, Internlm, and Gemma 2 almost always get this right. The prompts are so short and simple, plus "but only after first defining..." and "Then end by sharing..." seem like very obvious directives.

I'm starting to think there's a compatibility issue between llama.cpp and llama 3.1 (and to a lesser degree llama 3), since other LLMs, including Qwen2 and Gemma 2, appear to behave comparably when using GGUFs vs providers, such as LMsys.

Plus others are reporting significant quality differences in Llama 3 8b Instruct implementations hosted by various providers, such as the example linked below, which could be because they're using quantized versions.

https://x.com/ankrgyl/status/1818036782543634439

Accuracy.jpg

are you using llamacpp cli?
If yes show the command.
What template do you have?

I am using command for instance

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 100000 --interactive -ngl 99 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template llama3

@mirek190 Thanks, that was very helpful!

I was using GPT4All and Koboldcpp, but when I tried the cli with your recommended settings I only got one error after 8 runs (actually 16 since each was run at temp 0 and 0.6), which is in line with what's expected. I'll have to figure out how to make GPT4ALL work as well, but at least now I know it's not the GGUF.

In addition, this fixed another issue (information retrieval). For example, "Which two actresses played the two ex-wives of Alan Harper in Two and a Half Men?". It always got it wrong before, but after two tries with your settings (temp 0.6) using the cli it got it right.

first wrong

  1. Constance Zimmer (as Missi)
  2. Melanie Lynskey (as Rose)

then right

  1. Judith Melnick (played by Marin Hinkle)
  2. Kandi (played by April Bowlby)

Even when using LMsys and the cli with your settings Llama 3.1 8b Instruct regularly goes on a correcting and second guessing rant when asked about weakly held info. The following was from LMsys to the aforementioned prompt. I wonder if this is caused by COT fine-tuning.

"...The two actresses who played the ex-wives of Alan Harper are:

Susan Sarandon's character was not the ex-wife of Alan Harper.
Megan Fox was not an ex-wife of Alan Harper.
Judy (Maggie), played by Melanie Lynskey, was not one of the ex-wives of Alan Harper
Charlotte, played by Emmanuelle Vaugier, was not one of the ex-wives of Alan Harper
Kandi, played by Sarah Chalke, was not one of the ex-wives of Alan Harper
Walden's mother, played by Constance Zimmer, was not the ex-wife of Alan Harper

However, the two actresses who played the ex-wives of Alan Harper are:

Berta's nemesis, Rose, played by Melanie Lynskey was not an ex-wife
Walden's mother was not an ex-wife
Judy was Walden's mom
Kandi was not an ex-wife
The two women who were Alan's ex-wives are Judith and Kandi.

The two actresses who played the ex-wives of Alan Harper are:

Judith, played by Marin Hinkle
Kandi, played by Judy

However, another ex-wife of Alan Harper is Jacqueline who was played by Judy Greer"

You're welcome .

You're welcome .

i still don't understand... You just showed your llama-cli with "--chat-template llama3" at the end. But i'm using this promt template by bartowski
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
And when i'm asking same questions with your parameters, it's not giving correct answers... Everything same, just instead of "--chat-template llama3" i have -p "<|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\n\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

I don't understand your question...

I don't understand your question...

i'm asking the model my usual reasoning questions, and with your settings it fails very often.

give examples ...

Sign up or log in to comment