Qwen
/

Qwen2.5-7B-Instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

jklj077 commited on Sep 17, 2024

Commit

ddbda89

·

verified ·

1 Parent(s): 4e8250b

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -108,21 +108,21 @@ For deployment, we recommend using vLLM. You can enable the long-context capabil
 3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
     ```bash
-    python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model path/to/weights
     ```
     Then you can access the Chat API by:
     ```bash
     curl http://localhost:8000/v1/chat/completions \
         -H "Content-Type: application/json" \
         -d '{
-        "model": "Qwen2-7B-Instruct",
         "messages": [
             {"role": "system", "content": "You are a helpful assistant."},
             {"role": "user", "content": "Your Long Input Here."}
         ]
         }'
     ```
-    For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).
 **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
 ## Evaultion & Performance

 3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
     ```bash
+    python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2.5-7B-Instruct --model path/to/weights
     ```
     Then you can access the Chat API by:
     ```bash
     curl http://localhost:8000/v1/chat/completions \
         -H "Content-Type: application/json" \
         -d '{
+        "model": "Qwen2.5-7B-Instruct",
         "messages": [
             {"role": "system", "content": "You are a helpful assistant."},
             {"role": "user", "content": "Your Long Input Here."}
         ]
         }'
     ```
+    For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2.5).
 **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
 ## Evaultion & Performance