yinsong1986 commited on
Commit
90fed38
·
1 Parent(s): 0ae4abd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -63
README.md CHANGED
@@ -80,7 +80,10 @@ there were some limitations on its performance on longer context. Motivated by i
80
  - **Model License:** Apache 2.0
81
  - **Contact:** [GitHub issues](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/issues)
82
 
83
- ## How to Use MistralFlite from Python Code ##
 
 
 
84
  ### Install the necessary packages
85
 
86
  Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later,
@@ -128,7 +131,78 @@ for seq in sequences:
128
  <|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
129
  ```
130
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  ## How to Deploy MistralFlite on Amazon SageMaker ##
 
 
 
 
132
  ### Install the necessary packages
133
 
134
  Requires: [sagemaker](https://pypi.org/project/sagemaker/) 2.192.1 or later.
@@ -231,72 +305,12 @@ result = call_endpoint(client, prompt, endpoint_name, parameters)
231
  print(result)
232
  ```
233
 
234
- ## How to Serve MistralFlite on TGI ##
235
-
236
- ### Start TGI server ###
237
- Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
238
-
239
- Example Docker parameters:
240
-
241
- ```shell
242
- docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
243
- --model-id amazon/MistralLite \
244
- --max-input-length 16000 \
245
- --max-total-tokens 16384 \
246
- --max-batch-prefill-tokens 16384 \
247
- --trust-remote-code
248
- ```
249
-
250
- ### Perform Inference ###
251
- Example Python code for inference with TGI (requires `text_generation` 0.6.1 or later):
252
-
253
- ```shell
254
- pip install text_generation==0.6.1
255
- ```
256
-
257
- ```python
258
- from text_generation import Client
259
-
260
- SERVER_PORT = 443
261
- SERVER_HOST = "localhost"
262
- SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
263
- tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
264
-
265
- def invoke_tgi(prompt,
266
- random_seed=1,
267
- max_new_tokens=400,
268
- print_stream=True,
269
- assist_role=True):
270
- if (assist_role):
271
- prompt = f"<|prompter|>{prompt}</s><|assistant|>"
272
- output = ""
273
- for response in tgi_client.generate_stream(
274
- prompt,
275
- do_sample=False,
276
- max_new_tokens=max_new_tokens,
277
- return_full_text=False,
278
- #temperature=None,
279
- #truncate=None,
280
- #seed=random_seed,
281
- #typical_p=0.2,
282
- ):
283
- if hasattr(response, "token"):
284
- if not response.token.special:
285
- snippet = response.token.text
286
- output += snippet
287
- if (print_stream):
288
- print(snippet, end='', flush=True)
289
- return output
290
-
291
- prompt = "What are the main challenges to support a long context for LLM?"
292
- result = invoke_tgi(prompt)
293
- ```
294
-
295
- **Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
296
 
297
  ## How to Serve MistralFlite on vLLM ##
298
  Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
299
 
 
 
300
  ### Using vLLM as a server ###
301
  When using vLLM as a server, pass the --model amazon/MistralLite parameter, for example:
302
  ```shell
 
80
  - **Model License:** Apache 2.0
81
  - **Contact:** [GitHub issues](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/issues)
82
 
83
+ ## How to Use MistralFlite from Python Code (HuggingFace transformers) ##
84
+
85
+ **Important** - For an end-to-end example Jupyter notebook, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/huggingface-transformers/example_usage.ipynb).
86
+
87
  ### Install the necessary packages
88
 
89
  Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later,
 
131
  <|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
132
  ```
133
 
134
+ ## How to Serve MistralFlite on TGI ##
135
+ **Important:**
136
+ - For an end-to-end example Jupyter notebook using the native TGI container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/tgi/example_usage.ipynb).
137
+ - If the **input context length is greater than 12K tokens**, it is recommended using a custom TGI container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/tgi-custom/example_usage.ipynb).
138
+
139
+ ### Start TGI server ###
140
+ Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
141
+
142
+ Example Docker parameters:
143
+
144
+ ```shell
145
+ docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
146
+ --model-id amazon/MistralLite \
147
+ --max-input-length 16000 \
148
+ --max-total-tokens 16384 \
149
+ --max-batch-prefill-tokens 16384 \
150
+ --trust-remote-code
151
+ ```
152
+
153
+ ### Perform Inference ###
154
+ Example Python code for inference with TGI (requires `text_generation` 0.6.1 or later):
155
+
156
+ ```shell
157
+ pip install text_generation==0.6.1
158
+ ```
159
+
160
+ ```python
161
+ from text_generation import Client
162
+
163
+ SERVER_PORT = 443
164
+ SERVER_HOST = "localhost"
165
+ SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
166
+ tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
167
+
168
+ def invoke_tgi(prompt,
169
+ random_seed=1,
170
+ max_new_tokens=400,
171
+ print_stream=True,
172
+ assist_role=True):
173
+ if (assist_role):
174
+ prompt = f"<|prompter|>{prompt}</s><|assistant|>"
175
+ output = ""
176
+ for response in tgi_client.generate_stream(
177
+ prompt,
178
+ do_sample=False,
179
+ max_new_tokens=max_new_tokens,
180
+ return_full_text=False,
181
+ #temperature=None,
182
+ #truncate=None,
183
+ #seed=random_seed,
184
+ #typical_p=0.2,
185
+ ):
186
+ if hasattr(response, "token"):
187
+ if not response.token.special:
188
+ snippet = response.token.text
189
+ output += snippet
190
+ if (print_stream):
191
+ print(snippet, end='', flush=True)
192
+ return output
193
+
194
+ prompt = "What are the main challenges to support a long context for LLM?"
195
+ result = invoke_tgi(prompt)
196
+ ```
197
+
198
+ **Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
199
+
200
+
201
  ## How to Deploy MistralFlite on Amazon SageMaker ##
202
+ **Important:**
203
+ - For an end-to-end example Jupyter notebook using the SageMaker built-in container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/sagemaker-tgi/example_usage.ipynb).
204
+ - If the **input context length is greater than 12K tokens**, it is recommended using a custom docker container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/sagemaker-tgi-custom/example_usage.ipynb).
205
+
206
  ### Install the necessary packages
207
 
208
  Requires: [sagemaker](https://pypi.org/project/sagemaker/) 2.192.1 or later.
 
305
  print(result)
306
  ```
307
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
308
 
309
  ## How to Serve MistralFlite on vLLM ##
310
  Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
311
 
312
+ **Important** - For an end-to-end example Jupyter notebook, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/vllm/example_usage.ipynb).
313
+
314
  ### Using vLLM as a server ###
315
  When using vLLM as a server, pass the --model amazon/MistralLite parameter, for example:
316
  ```shell