Korean Style Transfer

This model is a fine-tuned version of Synatra-7B-v0.3-dpo using a Korean style dataset provided by Smilegate AI (https://github.com/smilegate-ai/korean_smile_style_dataset/tree/main). Since the original dataset is tabular and not fit for training the LLM, I have preprocessed it into an instruction-input-output format, which can be found here. The dataset is then fed into the ChatML template. Feel free to use my version of the dataset as needed.

ν•΄λ‹Ή λͺ¨λΈμ€ Synatra-7B-v0.3-dpo λͺ¨λΈμ„ 슀마일게이트 AIμ—μ„œ μ œκ³΅ν•˜λŠ” Smile style λ°μ΄ν„°μ…‹μœΌλ‘œ νŒŒμΈνŠœλ‹ ν–ˆμŠ΅λ‹ˆλ‹€. κΈ°μ‘΄ 데이터셋은 ν…Œμ΄λΈ” ν˜•νƒœλ‘œ λ˜μ–΄μžˆμ–΄ ν•΄λ‹Ή 데이터λ₯Ό instruction-input-output ν˜•νƒœλ‘œ λ§Œλ“€μ—ˆκ³ , μ—¬κΈ°μ—μ„œ 확인 κ°€λŠ₯ν•©λ‹ˆλ‹€. 데이터셋을 뢈러온 λ’€ ChatML ν˜•μ‹μ— 맞좰 ν›ˆλ ¨ 데이터 ꡬ좕을 ν•œ λ’€ μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€. ν•„μš”ν•˜μ‹œλ‹€λ©΄ 자유둭게 μ‚¬μš©ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.

How to use

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer

device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained('brian-lim/smile-style-transfer')
model = AutoModelForCausalLM.from_pretrained('brian-lim/smile-style-transfer', device_map = device)

prompts = {'informal': '주어진 글을 κ°€λŠ₯ν•œ ν˜•μ‹μ μ΄μ§€ μ•Šκ³  λ”±λ”±ν•˜μ§€ μ•Šμ€ λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
          'android': '주어진 글을 κ°€λŠ₯ν•œ μ•ˆλ“œλ‘œμ΄λ“œ λ‘œλ΄‡κ³Ό 같은 λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
          'azae': '주어진 글을 κ°€λŠ₯ν•œ 아저씨같은 말투둜 λ°”κΏ”μ€˜.',
          'chat': '주어진 글을 κ°€λŠ₯ν•œ 인터넷상에 μ‚¬μš©ν•˜λŠ” 말투둜 λ°”κΏ”μ€˜.',
          'choding': '주어진 글을 κ°€λŠ₯ν•œ μ΄ˆλ“±ν•™μƒμ²˜λŸΌ 짧게 쀄인 λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
          'emoticon': '주어진 글을 κ°€λŠ₯ν•œ 이λͺ¨ν‹°μ½˜μ΄ λ“€μ–΄κ°„ λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
          'enfp': '주어진 글을 κ°€λŠ₯ν•œ ν™œκΈ°μ°¨λ©΄μ„œ 곡감을 많이 ν•˜λŠ” μΉœμ ˆν•œ λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
          'gentle' : '주어진 글을 κ°€λŠ₯ν•œ β€œμš”β€λ‘œ λλ‚˜μ§€ μ•ŠμœΌλ©΄μ„œ κΉ”λ”ν•œ λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
          'halbae' : '주어진 글을 κ°€λŠ₯ν•œ μ—°λ₯œμ΄ μžˆλŠ” 할아버지 같은 맑투둜 λ°”κΏ”μ€˜.',
          'halmae' : '주어진 글을 κ°€λŠ₯ν•œ 비속어가 λ“€μ–΄κ°€λŠ” ν• λ¨Έλ‹ˆ 같은 맑투둜 λ°”κΏ”μ€˜.',
          'joongding': '주어진 글을 κ°€λŠ₯ν•œ 쀑학ꡐ 2ν•™λ…„μ˜ 말투둜 λ°”κΏ”μ€˜.',
          'king': '주어진 글을 κ°€λŠ₯ν•œ μ‘°μ„ μ‹œλŒ€ μ™•μ˜ 말투둜 λ°”κΏ”μ€˜.',
          'seonbi': '주어진 글을 κ°€λŠ₯ν•œ μ‘°μ„ μ‹œλŒ€ μ„ λΉ„μ˜ 말투둜 λ°”κΏ”μ€˜.',
          'sosim': '주어진 글을 κ°€λŠ₯ν•œ μ•„μ£Ό μ†Œμ‹¬ν•˜κ³  μ‘°μ‹¬μŠ€λŸ¬μš΄ 말투둜 λ°”κΏ”μ€˜.',
          'translator': '주어진 글을 κ°€λŠ₯ν•œ μ–΄μƒ‰ν•œ ν•œκ΅­μ–΄ λ²ˆμ—­ 말투둜 λ°”κΏ”μ€˜.',
          }
query = '[INPUT]: μ•ˆλ…•ν•˜μ„Έμš”. μš”μ¦˜ 날씨가 많이 μŒ€μŒ€ν•˜λ„€μš” \n[OUTPUT]: '

input_query = prompts['king'] + query
input_tokenized = tokenizer(input_query,return_tensors="pt").to(device)

g_config = GenerationConfig(temperature=0.3,
                        repetition_penalty=1.2,
                        max_new_tokens=768,
                        do_sample=True,
                        )
output = model.generate(**input_tokenized,
                      generation_config=g_config,        
                      pad_token_id=tokenizer.eos_token_id,
                      eos_token_id=tokenizer.eos_token_id,)
output_text = tokenizer.decode(output.detach().cpu().numpy()[0])
output_text = output_text[output_text.find('[OUTPUT]'):]
print(output_text)

license: apache-2.0

Downloads last month
41
Safetensors
Model size
7.24B params
Tensor type
BF16
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train brian-lim/smile-style-transfer