File size: 8,632 Bytes
3e054d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f78449
 
 
3e054d7
 
 
 
9c5625b
 
 
 
 
 
 
c4bc77a
ce235a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c4bc77a
 
 
 
 
 
 
 
 
 
 
 
 
9c5625b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e054d7
 
 
9c5625b
 
 
 
 
22f5917
9c5625b
 
3e054d7
c4bc77a
 
 
9c5625b
 
 
 
 
22f5917
9c5625b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c4bc77a
3e054d7
 
c4bc77a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e054d7
 
c4bc77a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e054d7
 
c4bc77a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---
base_model: google/gemma-2-2b-it
library_name: transformers
license: gemma
pipeline_tag: text-generation
tags:
- conversational
- llama-cpp
- gguf-my-repo
extra_gated_heading: Access Gemma on Hugging Face
extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and
  agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging
  Face and click below. Requests are processed immediately.
extra_gated_button_content: Acknowledge license
---

<img src='https://github.com/fabiomatricardi/Gemma2-2b-it-chatbot/raw/main/images/gemma2-2b-myGGUF.png' width=900>
<br><br><br>

# FM-1976/gemma-2-2b-it-Q5_K_M-GGUF
This model was converted to GGUF format from [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
Refer to the [original model card](https://huggingface.co/google/gemma-2-2b-it) for more details on the model.


## Description
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

## Model Details
context window = 8192
SYSTEM MESSAGE NOT SUPPORTED
```bash
architecture str = gemma2
        type str = model
        name str = Gemma 2 2b It
    finetune str = it
    basename str = gemma-2
  size_label str = 2B
     license str = gemma
       count u32 = 1
model.0.name str = Gemma 2 2b
organization str = Google
format           = GGUF V3 (latest)
arch             = gemma2
vocab type       = SPM
n_vocab          = 256000
n_merges         = 0
vocab_only       = 0
n_ctx_train      = 8192
n_embd           = 2304
n_layer          = 26
n_head           = 8
n_head_kv        = 4
model type       = 2B
model ftype      = Q5_K - Medium
model params     = 2.61 B
model size       = 1.79 GiB (5.87 BPW)
general.name     = Gemma 2 2b It
BOS token        = 2 '<bos>'
EOS token        = 1 '<eos>'
UNK token        = 3 '<unk>'
PAD token        = 0 '<pad>'
LF token         = 227 '<0x0A>'
EOT token        = 107 '<end_of_turn>'
EOG token        = 1 '<eos>'
EOG token        = 107 '<end_of_turn>'

>>> System role not supported
Available chat formats from metadata: chat_template.default
Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
' + message['content'] | trim + '<end_of_turn>
' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
'}}{% endif %}
Using chat eos_token: <eos>
Using chat bos_token: <bos>

```



### Prompt Format
```pthon
<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
```

## Chat Template
The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

```python
messages = [
    {"role": "user", "content": "Write me a poem about Machine Learning."},
]
```
## Use with llama-cpp-python
Install llama.cpp through brew (works on Mac and Linux)

```bash
pip install llama-cpp-python

```
### Download locally the GGUF file
```bash
wget https://huggingface.co/FM-1976/gemma-2-2b-it-Q5_K_M-GGUF/resolve/main/gemma-2-2b-it-q5_k_m.gguf  -OutFile gemma-2-2b-it-q5_k_m.gguf

```

### Open your Python REPL

#### Using chat_template
```python
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
            model_path='gemma-2-2b-it-q5_k_m.gguf',
            temperature=0.24,
            n_ctx=nCTX,
            max_tokens=600,
            repeat_penalty=1.176,
            stop=sTOPS,
            verbose=False,
            )
messages = [
    {"role": "user", "content": "Write me a poem about Machine Learning."},
]
response = llm.create_chat_completion(
                messages=messages,
                temperature=0.15,
                repeat_penalty= 1.178,
                stop=sTOPS,
                max_tokens=500)
print(response['choices'][0]['message']['content'])
```

#### Using create_completion
```python
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
            model_path='gemma-2-2b-it-q5_k_m.gguf',
            temperature=0.24,
            n_ctx=nCTX,
            max_tokens=600,
            repeat_penalty=1.176,
            stop=sTOPS,
            verbose=False,
            )
prompt = 'Explain Science in one sentence.'
template = f'''<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>'''
res = llm.create_completion(prompt,temperature=0.15, max_tokens=500,repeat_penalty=1.178, stop=['<eos>'])
print(res['choices'][0]['text'])
```


### Streaming text
llama-cpp-python allows you to also stream text during the inference<br>
Tokens are decoded and printed soon after gneration is done. You don't have to wait until the entire inference is done.
<br><br>
You can use both `create_chat_completion()` and `create_completion()` methods.
<br>

#### Streaming with `create_chat_completion()` method
```python
import datetime
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
            model_path='gemma-2-2b-it-q5_k_m.gguf',
            temperature=0.24,
            n_ctx=nCTX,
            max_tokens=600,
            repeat_penalty=1.176,
            stop=sTOPS,
            verbose=False,
            )
fisrtround=0
full_response = ''
message = [{'role':'user','content':'what is science?'}]
start = datetime.datetime.now()
for chunk in llm.create_chat_completion(
    messages=message,
    temperature=0.15,
    repeat_penalty= 1.31,
    stop=['<eos>'],
    max_tokens=500,
    stream=True,):
    try:
        if chunk["choices"][0]["delta"]["content"]:
            if fisrtround==0:
                print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
                full_response += chunk["choices"][0]["delta"]["content"]
                ttftoken = datetime.datetime.now() - start  
                fisrtround = 1
            else:
                print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
                full_response += chunk["choices"][0]["delta"]["content"]                              
    except:
        pass  
first_token_time = ttftoken.total_seconds()
print(f'Time to first token: {first_token_time:.2f} seconds')
```

#### Streaming with `create_completion()` method

```python
import datetime
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
            model_path='gemma-2-2b-it-q5_k_m.gguf',
            temperature=0.24,
            n_ctx=nCTX,
            max_tokens=600,
            repeat_penalty=1.176,
            stop=sTOPS,
            verbose=False,
            )
fisrtround=0
full_response = ''
prompt = 'Explain Science in one sentence.'
template = f'''<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>'''
start = datetime.datetime.now()
for chunk in llm.create_completion(
    prompt,
    temperature=0.15,
    repeat_penalty= 1.78,
    stop=['<eos>'],
    max_tokens=500,
    stream=True,):
    if fisrtround==0:
        print(chunk["choices"][0]["text"], end="", flush=True)
        full_response += chunk["choices"][0]["text"]
        ttftoken = datetime.datetime.now() - start
        fisrtround = 1
    else:
        print(chunk["choices"][0]["text"], end="", flush=True)
        full_response += chunk["choices"][0]["text"]

first_token_time = ttftoken.total_seconds()
print(f'Time to first token: {first_token_time:.2f} seconds')
```

### Further exploration
You can also serve the model with an OpenAI compliant API server<br>
This can be done both with `llama-cpp-python[server]` and `llamafile`.