File size: 8,632 Bytes
3e054d7 0f78449 3e054d7 9c5625b c4bc77a ce235a5 c4bc77a 9c5625b 3e054d7 9c5625b 22f5917 9c5625b 3e054d7 c4bc77a 9c5625b 22f5917 9c5625b c4bc77a 3e054d7 c4bc77a 3e054d7 c4bc77a 3e054d7 c4bc77a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 |
---
base_model: google/gemma-2-2b-it
library_name: transformers
license: gemma
pipeline_tag: text-generation
tags:
- conversational
- llama-cpp
- gguf-my-repo
extra_gated_heading: Access Gemma on Hugging Face
extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and
agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging
Face and click below. Requests are processed immediately.
extra_gated_button_content: Acknowledge license
---
<img src='https://github.com/fabiomatricardi/Gemma2-2b-it-chatbot/raw/main/images/gemma2-2b-myGGUF.png' width=900>
<br><br><br>
# FM-1976/gemma-2-2b-it-Q5_K_M-GGUF
This model was converted to GGUF format from [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
Refer to the [original model card](https://huggingface.co/google/gemma-2-2b-it) for more details on the model.
## Description
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.
## Model Details
context window = 8192
SYSTEM MESSAGE NOT SUPPORTED
```bash
architecture str = gemma2
type str = model
name str = Gemma 2 2b It
finetune str = it
basename str = gemma-2
size_label str = 2B
license str = gemma
count u32 = 1
model.0.name str = Gemma 2 2b
organization str = Google
format = GGUF V3 (latest)
arch = gemma2
vocab type = SPM
n_vocab = 256000
n_merges = 0
vocab_only = 0
n_ctx_train = 8192
n_embd = 2304
n_layer = 26
n_head = 8
n_head_kv = 4
model type = 2B
model ftype = Q5_K - Medium
model params = 2.61 B
model size = 1.79 GiB (5.87 BPW)
general.name = Gemma 2 2b It
BOS token = 2 '<bos>'
EOS token = 1 '<eos>'
UNK token = 3 '<unk>'
PAD token = 0 '<pad>'
LF token = 227 '<0x0A>'
EOT token = 107 '<end_of_turn>'
EOG token = 1 '<eos>'
EOG token = 107 '<end_of_turn>'
>>> System role not supported
Available chat formats from metadata: chat_template.default
Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
' + message['content'] | trim + '<end_of_turn>
' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
'}}{% endif %}
Using chat eos_token: <eos>
Using chat bos_token: <bos>
```
### Prompt Format
```pthon
<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
```
## Chat Template
The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.
```python
messages = [
{"role": "user", "content": "Write me a poem about Machine Learning."},
]
```
## Use with llama-cpp-python
Install llama.cpp through brew (works on Mac and Linux)
```bash
pip install llama-cpp-python
```
### Download locally the GGUF file
```bash
wget https://huggingface.co/FM-1976/gemma-2-2b-it-Q5_K_M-GGUF/resolve/main/gemma-2-2b-it-q5_k_m.gguf -OutFile gemma-2-2b-it-q5_k_m.gguf
```
### Open your Python REPL
#### Using chat_template
```python
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
model_path='gemma-2-2b-it-q5_k_m.gguf',
temperature=0.24,
n_ctx=nCTX,
max_tokens=600,
repeat_penalty=1.176,
stop=sTOPS,
verbose=False,
)
messages = [
{"role": "user", "content": "Write me a poem about Machine Learning."},
]
response = llm.create_chat_completion(
messages=messages,
temperature=0.15,
repeat_penalty= 1.178,
stop=sTOPS,
max_tokens=500)
print(response['choices'][0]['message']['content'])
```
#### Using create_completion
```python
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
model_path='gemma-2-2b-it-q5_k_m.gguf',
temperature=0.24,
n_ctx=nCTX,
max_tokens=600,
repeat_penalty=1.176,
stop=sTOPS,
verbose=False,
)
prompt = 'Explain Science in one sentence.'
template = f'''<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>'''
res = llm.create_completion(prompt,temperature=0.15, max_tokens=500,repeat_penalty=1.178, stop=['<eos>'])
print(res['choices'][0]['text'])
```
### Streaming text
llama-cpp-python allows you to also stream text during the inference<br>
Tokens are decoded and printed soon after gneration is done. You don't have to wait until the entire inference is done.
<br><br>
You can use both `create_chat_completion()` and `create_completion()` methods.
<br>
#### Streaming with `create_chat_completion()` method
```python
import datetime
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
model_path='gemma-2-2b-it-q5_k_m.gguf',
temperature=0.24,
n_ctx=nCTX,
max_tokens=600,
repeat_penalty=1.176,
stop=sTOPS,
verbose=False,
)
fisrtround=0
full_response = ''
message = [{'role':'user','content':'what is science?'}]
start = datetime.datetime.now()
for chunk in llm.create_chat_completion(
messages=message,
temperature=0.15,
repeat_penalty= 1.31,
stop=['<eos>'],
max_tokens=500,
stream=True,):
try:
if chunk["choices"][0]["delta"]["content"]:
if fisrtround==0:
print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
full_response += chunk["choices"][0]["delta"]["content"]
ttftoken = datetime.datetime.now() - start
fisrtround = 1
else:
print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
full_response += chunk["choices"][0]["delta"]["content"]
except:
pass
first_token_time = ttftoken.total_seconds()
print(f'Time to first token: {first_token_time:.2f} seconds')
```
#### Streaming with `create_completion()` method
```python
import datetime
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
model_path='gemma-2-2b-it-q5_k_m.gguf',
temperature=0.24,
n_ctx=nCTX,
max_tokens=600,
repeat_penalty=1.176,
stop=sTOPS,
verbose=False,
)
fisrtround=0
full_response = ''
prompt = 'Explain Science in one sentence.'
template = f'''<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>'''
start = datetime.datetime.now()
for chunk in llm.create_completion(
prompt,
temperature=0.15,
repeat_penalty= 1.78,
stop=['<eos>'],
max_tokens=500,
stream=True,):
if fisrtround==0:
print(chunk["choices"][0]["text"], end="", flush=True)
full_response += chunk["choices"][0]["text"]
ttftoken = datetime.datetime.now() - start
fisrtround = 1
else:
print(chunk["choices"][0]["text"], end="", flush=True)
full_response += chunk["choices"][0]["text"]
first_token_time = ttftoken.total_seconds()
print(f'Time to first token: {first_token_time:.2f} seconds')
```
### Further exploration
You can also serve the model with an OpenAI compliant API server<br>
This can be done both with `llama-cpp-python[server]` and `llamafile`.
|