FM-1976
/

gemma-2-2b-it-Q5_K_M-GGUF

@@ -25,6 +25,54 @@ Gemma is a family of lightweight, state-of-the-art open models from Google, buil
 ## Model Details
 context window = 8192
 SYSTEM MESSAGE NOT SUPPORTED
 ### Prompt Format
 ```pthon
@@ -55,7 +103,9 @@ wget https://huggingface.co/FM-1976/gemma-2-2b-it-Q5_K_M-GGUF/resolve/main/gemma
 ```
-Open your Python REPL
 ```python
 from llama_cpp import Llama
 nCTX = 8192
@@ -78,17 +128,130 @@ response = llm.create_chat_completion(
                 repeat_penalty= 1.178,
                 stop=sTOPS,
                 max_tokens=500)
-print(response)
 ```
-### CLI:
-```bash
-llama-cli --hf-repo FM-1976/gemma-2-2b-it-Q5_K_M-GGUF --hf-file gemma-2-2b-it-q5_k_m.gguf -p "The meaning to life and the universe is"
 ```
-### Server:
-```bash
-llama-server --hf-repo FM-1976/gemma-2-2b-it-Q5_K_M-GGUF --hf-file gemma-2-2b-it-q5_k_m.gguf -c 2048
 ```
-add llama-cpp-server...

 ## Model Details
 context window = 8192
 SYSTEM MESSAGE NOT SUPPORTED
+```bash
+llama_model_loader: - kv   0:                       general.architecture str              = gemma2
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Gemma 2 2b It
+llama_model_loader: - kv   3:                           general.finetune str              = it
+llama_model_loader: - kv   4:                           general.basename str              = gemma-2
+llama_model_loader: - kv   5:                         general.size_label str              = 2B
+llama_model_loader: - kv   6:                            general.license str              = gemma
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 2 2b
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
+llm_load_print_meta: format           = GGUF V3 (latest)
+llm_load_print_meta: arch             = gemma2
+llm_load_print_meta: vocab type       = SPM
+llm_load_print_meta: n_vocab          = 256000
+llm_load_print_meta: n_merges         = 0
+llm_load_print_meta: vocab_only       = 0
+llm_load_print_meta: n_ctx_train      = 8192
+llm_load_print_meta: n_embd           = 2304
+llm_load_print_meta: n_layer          = 26
+llm_load_print_meta: n_head           = 8
+llm_load_print_meta: n_head_kv        = 4
+llm_load_print_meta: model type       = 2B
+llm_load_print_meta: model ftype      = Q5_K - Medium
+llm_load_print_meta: model params     = 2.61 B
+llm_load_print_meta: model size       = 1.79 GiB (5.87 BPW)
+llm_load_print_meta: general.name     = Gemma 2 2b It
+llm_load_print_meta: BOS token        = 2 '<bos>'
+llm_load_print_meta: EOS token        = 1 '<eos>'
+llm_load_print_meta: UNK token        = 3 '<unk>'
+llm_load_print_meta: PAD token        = 0 '<pad>'
+llm_load_print_meta: LF token         = 227 '<0x0A>'
+llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
+llm_load_print_meta: EOG token        = 1 '<eos>'
+llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
+>>> System role not supported
+Available chat formats from metadata: chat_template.default
+Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
+' + message['content'] | trim + '<end_of_turn>
+' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
+'}}{% endif %}
+Using chat eos_token: <eos>
+Using chat bos_token: <bos>
+```
 ### Prompt Format
 ```pthon
 ```
+### Open your Python REPL
+#### Using chat_template
 ```python
 from llama_cpp import Llama
 nCTX = 8192
                 repeat_penalty= 1.178,
                 stop=sTOPS,
                 max_tokens=500)
+print(response['choices'][0]['message']['content'])
 ```
+#### Using create_completion
+```python
+from llama_cpp import Llama
+nCTX = 8192
+sTOPS = ['<eos>']
+llm = Llama(
+            model_path='gemma-2-2b-it-q5_k_m.gguf',
+            temperature=0.24,
+            n_ctx=nCTX,
+            max_tokens=600,
+            repeat_penalty=1.176,
+            stop=sTOPS,
+            verbose=False,
+            )
+prompt = 'Explain Science in one sentence.'
+template = f'''<bos><start_of_turn>user
+{prompt}<end_of_turn>
+<start_of_turn>model
+<end_of_turn>'''
+res = llm.create_completion(prompt,temperature=0.15, max_tokens=500,repeat_penalty=1.178, stop=['<eos>'])
+print(res['choices'][0]['text'])
 ```
+### Streaming text
+llama-cpp-python allows you to also stream text during the inference<br>
+Tokens are decoded and printed soon after gneration is done. You don't have to wait until the entire inference is done.
+<br><br>
+You can use both `create_chat_completion()` and `create_completion()` methods.
+<br>
+#### Streaming with `create_chat_completion()` method
+```python
+import datetime
+from llama_cpp import Llama
+nCTX = 8192
+sTOPS = ['<eos>']
+llm = Llama(
+            model_path='gemma-2-2b-it-q5_k_m.gguf',
+            temperature=0.24,
+            n_ctx=nCTX,
+            max_tokens=600,
+            repeat_penalty=1.176,
+            stop=sTOPS,
+            verbose=False,
+            )
+fisrtround=0
+full_response = ''
+message = [{'role':'user','content':'what is science?'}]
+start = datetime.datetime.now()
+for chunk in llm.create_chat_completion(
+    messages=message,
+    temperature=0.15,
+    repeat_penalty= 1.31,
+    stop=['<eos>'],
+    max_tokens=500,
+    stream=True,):
+    try:
+        if chunk["choices"][0]["delta"]["content"]:
+            if fisrtround==0:
+                print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
+                full_response += chunk["choices"][0]["delta"]["content"]
+                ttftoken = datetime.datetime.now() - start
+                fisrtround = 1
+            else:
+                print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
+                full_response += chunk["choices"][0]["delta"]["content"]
+    except:
+        pass
+first_token_time = ttftoken.total_seconds()
+print(f'Time to first token: {first_token_time:.2f} seconds')
 ```
+#### Streaming with `create_completion()` method
+```python
+import datetime
+from llama_cpp import Llama
+nCTX = 8192
+sTOPS = ['<eos>']
+llm = Llama(
+            model_path='gemma-2-2b-it-q5_k_m.gguf',
+            temperature=0.24,
+            n_ctx=nCTX,
+            max_tokens=600,
+            repeat_penalty=1.176,
+            stop=sTOPS,
+            verbose=False,
+            )
+fisrtround=0
+full_response = ''
+prompt = 'Explain Science in one sentence.'
+template = f'''<bos><start_of_turn>user
+{prompt}<end_of_turn>
+<start_of_turn>model
+<end_of_turn>'''
+start = datetime.datetime.now()
+for chunk in llm.create_completion(
+    prompt,
+    temperature=0.15,
+    repeat_penalty= 1.78,
+    stop=['<eos>'],
+    max_tokens=500,
+    stream=True,):
+    if fisrtround==0:
+        print(chunk["choices"][0]["text"], end="", flush=True)
+        full_response += chunk["choices"][0]["text"]
+        ttftoken = datetime.datetime.now() - start
+        fisrtround = 1
+    else:
+        print(chunk["choices"][0]["text"], end="", flush=True)
+        full_response += chunk["choices"][0]["text"]
+first_token_time = ttftoken.total_seconds()
+print(f'Time to first token: {first_token_time:.2f} seconds')
+```
+### Further exploration
+You can also serve the model with an OpenAI compliant API server<br>
+This can be done both with `llama-cpp-python[server]` and `llamafile`.