Phi-3-medium-4k-instruct-ko-poc-v0.1

Model Details

This model is trained using unsloth toolkit based on Microsoft's phi-3 Phi-3-medium-4k-instruct model (https://huggingface.co/unsloth/Phi-3-medium-4k-instruct) with some Korean instruction data added to enhance its Korean generation performance

Since my role is not as a working developer, but as ML Technical Specialist helping customers with quick PoCs/prototypes, and I was limited by Azure GPU resources available, I only trained with 40,000 samples on a single VM Azure Standard_NC24ads_A100_v4 for PoC purposes. Because I have not done any tokenizer extensions, you need a lot more tokens than English for text generation.

Dataset

The dataset used for training is as follows. To prevent catastrophic forgetting, I included non-Korean corpus as training data. Note that we did not use all of the data, but only sampled some of it. Korean textbooks were converted to Q&A format. The Guanaco dataset has been reformatted to fit the multiturn format like <|user|>\n{Q1}<|end|>\n<|assistant|>\n{A1}<|end|>\n<|user|>\n{Q2}<|end|>\n<|assistant|>\n{A2}<|end|>.

How to Get Started with the Model

Code snippets

### Load model
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model_path = "daekeun-ml/Phi-3-medium-4k-instruct-ko-poc-v0.1"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_tar_dir, # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

params = {
    "max_new_tokens": 256,
    "use_cache": True,
    "temperature": 0.05,
    "do_sample": True
}

### Inference
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# 1st example
messages = [
    {"from": "human", "value": "Continue the fibonnaci sequence in Korean: 1, 1, 2, 3, 5, 8,"},  
    {"from": "assistant", "value": "ν”Όλ³΄λ‚˜μΉ˜ μˆ˜μ—΄μ˜ λ‹€μŒ μˆ«μžλŠ” 13, 21, 34, 55, 89 λ“±μž…λ‹ˆλ‹€. 각 μˆ«μžλŠ” μ•žμ˜ 두 숫자의 ν•©μž…λ‹ˆλ‹€."},    
    {"from": "human", "value": "Compute 2x+3=12 in Korean"}, 
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, **params)

# 2nd example
messages = [
    {"from": "human", "value": "What is Machine Learning in Korean?"},  
    {"from": "assistant", "value": "인곡지λŠ₯의 ν•œ λΆ„μ•Όλ‘œ λ°©λŒ€ν•œ 데이터λ₯Ό 뢄석해 ν–₯ν›„ νŒ¨ν„΄μ„ μ˜ˆμΈ‘ν•˜λŠ” κΈ°λ²•μž…λ‹ˆλ‹€."},    
    {"from": "human", "value": "What is Deep Learning in Korean?"}, 
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, **params)

Inference results

# 1st example
<s><|user|> Continue the fibonnaci sequence in Korean: 1, 1, 2, 3, 5, 8,<|end|><|assistant|> ν”Όλ³΄λ‚˜μΉ˜ μˆ˜μ—΄μ˜ λ‹€μŒ μˆ«μžλŠ” 13, 21, 34, 55, 89 λ“±μž…λ‹ˆλ‹€. 각 μˆ«μžλŠ” μ•žμ˜ 두 숫자의 ν•©μž…λ‹ˆλ‹€.<|end|><|user|> Compute 2x+3=12 in Korean<|end|><|assistant|> 방정식 2x + 3 = 12μ—μ„œ xλ₯Ό ν’€λ €λ©΄ λ‹€μŒ 단계λ₯Ό λ”°λ₯΄μ‹­μ‹œμ˜€.

1. λ°©μ •μ‹μ˜ μ–‘μͺ½μ—μ„œ 3을 λΉΌμ„œ λ°©μ •μ‹μ˜ ν•œμͺ½μ— λŒ€ν•΄ xλ₯Ό λΆ„λ¦¬ν•©λ‹ˆλ‹€.

2x + 3 - 3 = 12 - 3

2x = 9

2. 이제 λ°©μ •μ‹μ˜ μ–‘μͺ½μ„ 2둜 λ‚˜λˆ„μ–΄ x의 값을 κ΅¬ν•©λ‹ˆλ‹€.

2λ°° / 2 = 9 / 2

x = 4.5

λ”°λΌμ„œ 방정식 2x + 3 = 12에 λŒ€ν•œ ν•΄λŠ” x = 4.5μž…λ‹ˆλ‹€.<|end|>

# 2nd example
<s><|user|> What is Machine Learning in Korean?<|end|><|assistant|> 인곡지λŠ₯의 ν•œ λΆ„μ•Όλ‘œ λ°©λŒ€ν•œ 데이터λ₯Ό 뢄석해 ν–₯ν›„ νŒ¨ν„΄μ„ μ˜ˆμΈ‘ν•˜λŠ” κΈ°λ²•μž…λ‹ˆλ‹€.<|end|><|user|> What is Deep Learning in Korean?<|end|><|assistant|> λ³΅μž‘ν•œ 데이터 μ„ΈνŠΈλ₯Ό λΆ„μ„ν•˜κ³  λ³΅μž‘ν•œ νŒ¨ν„΄μ„ μΈμ‹ν•˜κ³  ν•™μŠ΅ν•˜λŠ” 데 μ‚¬μš©λ˜λŠ” λ”₯λŸ¬λ‹μ€ λ§Žμ€ λ ˆμ΄μ–΄λ‘œ κ΅¬μ„±λœ μ‹ κ²½λ§μ˜ ν•˜μœ„ μ§‘ν•©μž…λ‹ˆλ‹€. 이 κΈ°μˆ μ€ 이미지 인식, μžμ—°μ–΄ 처리 및 자율 μš΄μ „κ³Ό 같은 λ‹€μ–‘ν•œ μ‘μš© λΆ„μ•Όμ—μ„œ 큰 λ°œμ „μ„ μ΄λ€˜μŠ΅λ‹ˆλ‹€.<|end|>

References

Notes

License

apache 2.0; The license of phi-3 is MIT, but I considered the licensing of the dataset and library used for training.

Caution

This model was created as a personal experiment, unrelated to the organization I work for. The model may not operate correctly because separate verification was not performed. Please be careful unless it is for personal experimentation or PoC (Proof of Concept)!

Downloads last month
19
Safetensors
Model size
14B params
Tensor type
BF16
Β·
Inference Examples
Inference API (serverless) has been turned off for this model.

Datasets used to train daekeun-ml/Phi-3-medium-4k-instruct-ko-poc-v0.1