jklj077 commited on
Commit
ddbda89
·
verified ·
1 Parent(s): 4e8250b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -108,21 +108,21 @@ For deployment, we recommend using vLLM. You can enable the long-context capabil
108
  3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
109
 
110
  ```bash
111
- python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model path/to/weights
112
  ```
113
  Then you can access the Chat API by:
114
  ```bash
115
  curl http://localhost:8000/v1/chat/completions \
116
  -H "Content-Type: application/json" \
117
  -d '{
118
- "model": "Qwen2-7B-Instruct",
119
  "messages": [
120
  {"role": "system", "content": "You are a helpful assistant."},
121
  {"role": "user", "content": "Your Long Input Here."}
122
  ]
123
  }'
124
  ```
125
- For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).
126
  **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
127
 
128
  ## Evaultion & Performance
 
108
  3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
109
 
110
  ```bash
111
+ python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2.5-7B-Instruct --model path/to/weights
112
  ```
113
  Then you can access the Chat API by:
114
  ```bash
115
  curl http://localhost:8000/v1/chat/completions \
116
  -H "Content-Type: application/json" \
117
  -d '{
118
+ "model": "Qwen2.5-7B-Instruct",
119
  "messages": [
120
  {"role": "system", "content": "You are a helpful assistant."},
121
  {"role": "user", "content": "Your Long Input Here."}
122
  ]
123
  }'
124
  ```
125
+ For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2.5).
126
  **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
127
 
128
  ## Evaultion & Performance