EEVE-Korean-Instruct-10.8B-v1.0-AWQ

Description

This repo contains AWQ model files for yanolja/EEVE-Korean-Instruct-10.8B-v1.0.

About AWQ

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.

It is supported by:

Using OpenAI Chat API with vLLM

Documentation on installing and using vLLM can be found here.

  • Please ensure you are using vLLM version 0.2 or later.
  • When using vLLM as a server, pass the --quantization awq parameter.

Start the OpenAI-Compatible Server:

  • vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API
python3 -m vllm.entrypoints.openai.api_server --model Copycats/EEVE-Korean-Instruct-10.8B-v1.0-AWQ --quantization awq --dtype half
  • --model: huggingface model path
  • --quantization: ”awq”
  • --dtype: β€œhalf” for FP16. Recommended for AWQ quantization.

Querying the model using OpenAI Chat API:

  • You can use the create chat completion endpoint to communicate with the model in a chat-like interface:
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Copycats/EEVE-Korean-Instruct-10.8B-v1.0-AWQ",
        "messages": [
            {"role": "system", "content": "당신은 μ‚¬μš©μžμ˜ μ§ˆλ¬Έμ— μΉœμ ˆν•˜κ²Œ λ‹΅λ³€ν•˜λŠ” μ–΄μ‹œμŠ€ν„΄νŠΈμž…λ‹ˆλ‹€."},
            {"role": "user", "content": "괜슀레 μŠ¬νΌμ„œ 눈물이 λ‚˜λ©΄ μ–΄λ–»κ²Œ ν•˜λ‚˜μš”?"}
        ]
    }'

Python Client Example:

  • Using the openai python package, you can also communicate with the model in a chat-like manner:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Copycats/EEVE-Korean-Instruct-10.8B-v1.0-AWQ",
    messages=[
        {"role": "system", "content": "당신은 μ‚¬μš©μžμ˜ μ§ˆλ¬Έμ— μΉœμ ˆν•˜κ²Œ λ‹΅λ³€ν•˜λŠ” μ–΄μ‹œμŠ€ν„΄νŠΈμž…λ‹ˆλ‹€."},
        {"role": "user", "content": "괜슀레 μŠ¬νΌμ„œ 눈물이 λ‚˜λ©΄ μ–΄λ–»κ²Œ ν•˜λ‚˜μš”?"},
    ]
)
print("Chat response:", chat_response)
Downloads last month
17
Safetensors
Model size
1.74B params
Tensor type
I32
Β·
FP16
Β·
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for Copycats/EEVE-Korean-Instruct-10.8B-v1.0-AWQ