brian-lim commited on
Commit
20c3afe
Β·
1 Parent(s): cd3d1a9

Add How to use

Browse files
Files changed (1) hide show
  1. README.md +44 -8
README.md CHANGED
@@ -8,24 +8,60 @@ language:
8
  # Korean Style Transfer
9
 
10
  This model is a fine-tuned version of [Synatra-7B-v0.3-dpo](https://huggingface.co/maywell/Synatra-7B-v0.3-dpo) using a Korean style dataset provided by Smilegate AI (https://github.com/smilegate-ai/korean_smile_style_dataset/tree/main).
11
- Since the original dataset is tabular and not fit for training the LLM, I have preprocessed it into instruction-input-output format, which can be found (here)[https://huggingface.co/datasets/brian-lim/smile_style_orca].
12
  The dataset is then fed into the ChatML template. Feel free to use my version of the dataset as needed.
13
 
14
  ν•΄λ‹Ή λͺ¨λΈμ€ [Synatra-7B-v0.3-dpo](https://huggingface.co/maywell/Synatra-7B-v0.3-dpo) λͺ¨λΈμ„ 슀마일게이트 AIμ—μ„œ μ œκ³΅ν•˜λŠ” Smile style λ°μ΄ν„°μ…‹μœΌλ‘œ νŒŒμΈνŠœλ‹ ν–ˆμŠ΅λ‹ˆλ‹€.
15
- κΈ°μ‘΄ 데이터셋은 ν…Œμ΄λΈ” ν˜•νƒœλ‘œ λ˜μ–΄μžˆμ–΄ ν•΄λ‹Ή 데이터λ₯Ό instruction-input-output ν˜•νƒœλ‘œ λ§Œλ“€μ—ˆκ³ , (μ—¬κΈ°)[https://huggingface.co/datasets/brian-lim/smile_style_orca]μ—μ„œ 확인 κ°€λŠ₯ν•©λ‹ˆλ‹€.
16
  데이터셋을 뢈러온 λ’€ ChatML ν˜•μ‹μ— 맞좰 ν›ˆλ ¨ 데이터 ꡬ좕을 ν•œ λ’€ μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€. ν•„μš”ν•˜μ‹œλ‹€λ©΄ 자유둭게 μ‚¬μš©ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.
17
 
18
- # Intended use & limitations
19
 
20
- To be added
21
 
22
- μΆ”κ°€ μ˜ˆμ •
 
 
23
 
24
- # How to use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- To be added
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- μΆ”κ°€μ˜ˆμ •
29
 
30
  ---
31
  license: apache-2.0
 
8
  # Korean Style Transfer
9
 
10
  This model is a fine-tuned version of [Synatra-7B-v0.3-dpo](https://huggingface.co/maywell/Synatra-7B-v0.3-dpo) using a Korean style dataset provided by Smilegate AI (https://github.com/smilegate-ai/korean_smile_style_dataset/tree/main).
11
+ Since the original dataset is tabular and not fit for training the LLM, I have preprocessed it into an instruction-input-output format, which can be found [here](https://huggingface.co/datasets/brian-lim/smile_style_orca).
12
  The dataset is then fed into the ChatML template. Feel free to use my version of the dataset as needed.
13
 
14
  ν•΄λ‹Ή λͺ¨λΈμ€ [Synatra-7B-v0.3-dpo](https://huggingface.co/maywell/Synatra-7B-v0.3-dpo) λͺ¨λΈμ„ 슀마일게이트 AIμ—μ„œ μ œκ³΅ν•˜λŠ” Smile style λ°μ΄ν„°μ…‹μœΌλ‘œ νŒŒμΈνŠœλ‹ ν–ˆμŠ΅λ‹ˆλ‹€.
15
+ κΈ°μ‘΄ 데이터셋은 ν…Œμ΄λΈ” ν˜•νƒœλ‘œ λ˜μ–΄μžˆμ–΄ ν•΄λ‹Ή 데이터λ₯Ό instruction-input-output ν˜•νƒœλ‘œ λ§Œλ“€μ—ˆκ³ , [μ—¬κΈ°](https://huggingface.co/datasets/brian-lim/smile_style_orca)μ—μ„œ 확인 κ°€λŠ₯ν•©λ‹ˆλ‹€.
16
  데이터셋을 뢈러온 λ’€ ChatML ν˜•μ‹μ— 맞좰 ν›ˆλ ¨ 데이터 ꡬ좕을 ν•œ λ’€ μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€. ν•„μš”ν•˜μ‹œλ‹€λ©΄ 자유둭게 μ‚¬μš©ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.
17
 
 
18
 
19
+ # How to use
20
 
21
+ ```python
22
+ >>> import torch
23
+ >>> from transformers import AutoModelForCausalLM, AutoTokenizer
24
 
25
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
26
+
27
+ tokenizer = AutoTokenizer.from_pretrained('brian-lim/smile-style-transfer')
28
+ model = AutoModelForCausalLM.from_pretrained('brian-lim/smile-style-transfer', device_map = device)
29
+
30
+ prompts = {'informal': '주어진 글을 κ°€λŠ₯ν•œ ν˜•μ‹μ μ΄μ§€ μ•Šκ³  λ”±λ”±ν•˜μ§€ μ•Šμ€ λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
31
+ 'android': '주어진 글을 κ°€λŠ₯ν•œ μ•ˆλ“œλ‘œμ΄λ“œ λ‘œλ΄‡κ³Ό 같은 λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
32
+ 'azae': '주어진 글을 κ°€λŠ₯ν•œ 아저씨같은 말투둜 λ°”κΏ”μ€˜.',
33
+ 'chat': '주어진 글을 κ°€λŠ₯ν•œ 인터넷상에 μ‚¬μš©ν•˜λŠ” 말투둜 λ°”κΏ”μ€˜.',
34
+ 'choding': '주어진 글을 κ°€λŠ₯ν•œ μ΄ˆλ“±ν•™μƒμ²˜λŸΌ 짧게 쀄인 λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
35
+ 'emoticon': '주어진 글을 κ°€λŠ₯ν•œ 이λͺ¨ν‹°μ½˜μ΄ λ“€μ–΄κ°„ λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
36
+ 'enfp': '주어진 글을 κ°€λŠ₯ν•œ ν™œκΈ°μ°¨λ©΄μ„œ 곡감을 많이 ν•˜λŠ” μΉœμ ˆν•œ λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
37
+ 'gentle' : '주어진 글을 κ°€λŠ₯ν•œ β€œμš”β€λ‘œ λλ‚˜μ§€ μ•ŠμœΌλ©΄μ„œ κΉ”λ”ν•œ λŒ€ν™”μ²΄λ‘œ λ°”κΏ”μ€˜.',
38
+ 'halbae' : '주어진 글을 κ°€λŠ₯ν•œ μ—°λ₯œμ΄ μžˆλŠ” 할아버지 같은 맑투둜 λ°”κΏ”μ€˜.',
39
+ 'halmae' : '주어진 글을 κ°€λŠ₯ν•œ 비속어가 λ“€μ–΄κ°€λŠ” ν• λ¨Έλ‹ˆ 같은 맑투둜 λ°”κΏ”μ€˜.',
40
+ 'joongding': '주어진 글을 κ°€λŠ₯ν•œ 쀑학ꡐ 2ν•™λ…„μ˜ 말투둜 λ°”κΏ”μ€˜.',
41
+ 'king': '주어진 글을 κ°€λŠ₯ν•œ μ‘°μ„ μ‹œλŒ€ μ™•μ˜ 말투둜 λ°”κΏ”μ€˜.',
42
+ 'seonbi': '주어진 글을 κ°€λŠ₯ν•œ μ‘°μ„ μ‹œλŒ€ μ„ λΉ„μ˜ 말투둜 λ°”κΏ”μ€˜.',
43
+ 'sosim': '주어진 글을 κ°€λŠ₯ν•œ μ•„μ£Ό μ†Œμ‹¬ν•˜κ³  μ‘°μ‹¬μŠ€λŸ¬μš΄ 말투둜 λ°”κΏ”μ€˜.',
44
+ 'translator': '주어진 글을 κ°€λŠ₯ν•œ μ–΄μƒ‰ν•œ ν•œκ΅­μ–΄ λ²ˆμ—­ 말투둜 λ°”κΏ”μ€˜.',
45
+ }
46
+ query = '[INPUT]: μ•ˆλ…•ν•˜μ„Έμš”. μš”μ¦˜ 날씨가 많이 μŒ€μŒ€ν•˜λ„€μš” \n[OUTPUT]: '
47
+
48
+ input_query = prompts['king'] + query
49
+ input_tokenized = tokenizer(input_query,return_tensors="pt").to(device)
50
 
51
+ g_config = GenerationConfig(temperature=0.3,
52
+ repetition_penalty=1.2,
53
+ max_new_tokens=768,
54
+ do_sample=True,
55
+ )
56
+ output = model.generate(**input_tokenized,
57
+ generation_config=g_config,
58
+ pad_token_id=tokenizer.eos_token_id,
59
+ eos_token_id=tokenizer.eos_token_id,)
60
+ output_text = tokenizer.decode(output.detach().cpu().numpy()[0])
61
+ output_text = output_text[output_text.find('[OUTPUT]'):]
62
+ print(output_text)
63
+ ```
64
 
 
65
 
66
  ---
67
  license: apache-2.0