Update README.md
Browse files
README.md
CHANGED
@@ -127,13 +127,13 @@ Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows
|
|
127 |
|
128 |
```bash
|
129 |
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
```
|
138 |
|
139 |
> [!NOTE]
|
@@ -143,42 +143,39 @@ To send request to the deployed TGI endpoint compatible with [OpenAI specificati
|
|
143 |
|
144 |
```bash
|
145 |
curl 0.0.0.0:8080/v1/chat/completions \
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
|
150 |
-
|
151 |
-
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
|
157 |
-
|
158 |
-
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
```
|
163 |
|
164 |
-
Or via the `
|
165 |
|
166 |
```python
|
167 |
import os
|
168 |
-
from openai import OpenAI
|
169 |
|
170 |
-
client = OpenAI(
|
171 |
-
base_url="http://0.0.0.0:8080/v1/",
|
172 |
-
api_key=os.getenv("HF_TOKEN"),
|
173 |
-
)
|
174 |
|
175 |
chat_completion = client.chat.completions.create(
|
176 |
-
|
177 |
-
|
178 |
-
|
179 |
-
|
180 |
-
|
181 |
-
|
182 |
)
|
183 |
```
|
184 |
|
|
|
127 |
|
128 |
```bash
|
129 |
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
|
130 |
+
-e MODEL_ID=hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
|
131 |
+
-e NUM_SHARD=4 \
|
132 |
+
-e QUANTIZE=awq \
|
133 |
+
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
|
134 |
+
-e MAX_INPUT_LENGTH=4000 \
|
135 |
+
-e MAX_TOTAL_TOKENS=4096 \
|
136 |
+
ghcr.io/huggingface/text-generation-inference:2.2.0
|
137 |
```
|
138 |
|
139 |
> [!NOTE]
|
|
|
143 |
|
144 |
```bash
|
145 |
curl 0.0.0.0:8080/v1/chat/completions \
|
146 |
+
-X POST \
|
147 |
+
-H 'Content-Type: application/json' \
|
148 |
+
-d '{
|
149 |
+
"model": "tgi",
|
150 |
+
"messages": [
|
151 |
+
{
|
152 |
+
"role": "system",
|
153 |
+
"content": "You are a helpful assistant."
|
154 |
+
},
|
155 |
+
{
|
156 |
+
"role": "user",
|
157 |
+
"content": "What is Deep Learning?"
|
158 |
+
}
|
159 |
+
],
|
160 |
+
"max_tokens": 128
|
161 |
+
}'
|
162 |
```
|
163 |
|
164 |
+
Or programatically via the `huggingface_hub` Python client as follows (TGI is fully compatible with OpenAI so its `openai` SDK can also be used):
|
165 |
|
166 |
```python
|
167 |
import os
|
168 |
+
from huggingface_hub import InferenceClient # Instead of `from openai import OpenAI`
|
169 |
|
170 |
+
client = InferenceClient(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("HF_TOKEN", "-")) # Instead of `client = OpenAI(base_url=..., api_key=...)
|
|
|
|
|
|
|
171 |
|
172 |
chat_completion = client.chat.completions.create(
|
173 |
+
model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", # Instead of `model="tgi"`
|
174 |
+
messages=[
|
175 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
176 |
+
{"role": "user", "content": "What is Deep Learning?"},
|
177 |
+
],
|
178 |
+
max_tokens=128,
|
179 |
)
|
180 |
```
|
181 |
|