metadata
language:
- en
license: mit
library_name: transformers
tags:
- code
base_model:
- google/gemma-1.1-2b-it
datasets:
- kreimben/leetcode_with_youtube_captions
- kreimben/leetcode_user_submissions
widget:
- text: >-
explain about two sum problem. from brute force approach to the most
advanced algorithms.
example_title: two sum example
- text: explain about leetcode 72 edit distance. i don't get even the approach.
example_title: edit distance example
- text: explain about leetcode 139 Word Break. please give me the approach.
example_title: word break example
inference:
parameters:
max_new_tokens: 250
temperature: 0.3
pipeline_tag: text-generation
CodeMind
์๊ฐ
์ฝ๋ฉ ํ ์คํธ ๋ฌธ์ ํด๊ฒฐ ๋ฐ ํ์ต ๋ณด์กฐ๋ฅผ ์ง์ํด ์ฃผ๋ ์ธ์ด ๋ชจ๋ธ์ ๋๋ค. Leetcode ํด์ค ์์ ์๋ง ๋ฐ ์ ์ ๋ค์ ํฌ์คํ ๊ธ์ ์ด์ฉํด ํ์ธํ๋ํ์ฌ ์ฝ๋ฉ ํ ์คํธ์ ์กฐ๊ธ ๋ ํนํ๋ ๋ต์์ ์ ์ํด ์ค ์ ์๊ฒ ํ์์ต๋๋ค.
๋ชจ๋ธ ์ธ๋ถ ์ ๋ณด
- ๋ชจ๋ธ ์ด๋ฆ: CodeMind
- ๊ธฐ๋ณธ ๋ชจ๋ธ: google/gemma-1.1-2b-it
- ํ๋ จ ์ธ์ด: ์์ด
- ๋ชจ๋ธ ํฌ๊ธฐ: 2.51B ํ๋ผ๋ฏธํฐ
ํ์ ๊ตฌ์ฑ
- NLP 3๋ช
- SRE 2๋ช
์ฃผ์ ๊ธฐ๋ฅ
- ๋ฌธ์ ์ ํ ๋ฐ ์ ๊ทผ๋ฒ ์ค๋ช
- ์ ๋ต ์ฝ๋ ์์ฑ
ํ๋ จ ๋ฐ์ดํฐ
- LeetCode ์ฌ์ฉ์ ์ ์ถ๋ฌผ: ๋ค์ํ ์๊ณ ๋ฆฌ์ฆ ๋ฌธ์ ์ ํ์ด์ฌ ์๋ฃจ์
- ์ ํ๋ธ ์บก์ : LeetCode ๋ฌธ์ ์ ๋ํ ์ค๋ช ๋ฐ ๋จ๊ณ๋ณ ๊ฐ์ด๋
์ฌ์ฉ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- transformers: ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- datasets: ๋ฐ์ดํฐ์ ์ฒ๋ฆฌ ๋ฐ ๊ด๋ฆฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- bitsandbytes: ์ต์ ํ๋ ์ฐ์ฐ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- peft: ํ์ธ ํ๋์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- trl: ์ธ์ด ๋ชจ๋ธ ํ๋์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- pandas: ๋ฐ์ดํฐ ์กฐ์์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
ํ์ผ ๊ตฌ์กฐ
- dataset/: ๋ฐ์ดํฐ์ ํ์ผ์ ํฌํจํฉ๋๋ค.
- eval/: ํ๊ฐ ์คํฌ๋ฆฝํธ๋ฅผ ํฌํจํฉ๋๋ค.
- fine-tuning/: fine tuning ๊ด๋ จ ๋
ธํธ๋ถ ๋ฐ ์คํฌ๋ฆฝํธ๋ฅผ ํฌํจํฉ๋๋ค.
gemma-1.1-2b-it peft qlora.ipynb
: fine tuning ๊ณผ์ ์ ๋ํ ์ธ๋ถ ์ฌํญ์ด ํฌํจ๋ ๋ ธํธ๋ถ์ ๋๋ค.
- demo.ipynb: ๋ฐ๋ชจ ๋ ธํธ๋ถ์ผ๋ก ๋ชจ๋ธ ์ฌ์ฉ ์์ ๊ฐ ํฌํจ๋์ด ์์ต๋๋ค.
- requirements.txt: ํ๋ก์ ํธ ์์กด์ฑ ๋ชฉ๋ก์ด ํฌํจ๋์ด ์์ต๋๋ค.
- utils.py: ์ ํธ๋ฆฌํฐ ํจ์๋ค์ด ํฌํจ๋์ด ์์ต๋๋ค.
์ฌ์ฉ ๋ฐฉ๋ฒ
์ด ๋ชจ๋ธ์ HuggingFace์ ๋ชจ๋ธ ํ๋ธ๋ฅผ ํตํด ์ก์ธ์คํ ์ ์์ผ๋ฉฐ, API๋ฅผ ์ฌ์ฉํ์ฌ ์์ฉ ํ๋ก๊ทธ๋จ์ ํตํฉํ ์ ์์ต๋๋ค. ์ฝ๋ฉ ๋ฌธ์ ๋๋ ํ๋ก๊ทธ๋๋ฐ ๊ด๋ จ ์ง๋ฌธ์ ์ ๊ณตํ๋ฉด ๋ชจ๋ธ์ด ๊ด๋ จ ์ค๋ช , ์ฝ๋ ์ค๋ํซ ๋๋ ๊ฐ์ด๋๋ฅผ ์์ฑํฉ๋๋ค.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("kreimben/CodeMind-gemma-2b")
model = AutoModelForCausalLM.from_pretrained("kreimben/CodeMind-gemma-2b")
inputs = tokenizer("์ฝ๋ฉ ๋ฌธ์ ๋ ์ง๋ฌธ์ ์ฌ๊ธฐ์ ์
๋ ฅํ์ธ์", return_tensors="pt")
outputs = model.generate(inputs.input_ids)
print(tokenizer.decode(outputs[0]))
ํ๋ จ ๊ณผ์
๋ชจ๋ธ ๋ฐ ํ ํฌ๋์ด์ ๋ก๋
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_id = 'google/gemma-1.1-2b-it'
token = os.getenv('HF_READ')
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"": 0}, token=token)
model.config.use_cache = False
model.gradient_checkpointing_enable()
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
LoRA ๊ตฌ์ฑ ๋ฐ ๋ชจ๋ธ ์ค๋น
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import bitsandbytes as bnb
model = prepare_model_for_kbit_training(model)
def find_all_linear_names(model):
cls = bnb.nn.Linear4bit
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
names = name.split('.')
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if 'lm_head' in lora_module_names:
lora_module_names.remove('lm_head')
return list(lora_module_names)
modules = find_all_linear_names(model)
lora_config = LoraConfig(
r=64,
lora_alpha=32,
target_modules=modules,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
๋ฐ์ดํฐ ์ค๋น
import pandas as pd
from datasets import Dataset
submission_dataset = datasets.load_dataset('kreimben/leetcode_user_submissions_only_python', split='train').to_pandas()
submission_dataset = submission_dataset[['title', 'question_hints', 'question_content', 'content']]
captions_dataset = datasets.load_dataset('kreimben/leetcode_with_youtube_captions', split='train').to_pandas()
captions_dataset = captions_dataset[['title', 'question_hints', 'question_content', 'cc_content']]
captions_dataset.rename(columns={'cc_content': 'content'}, inplace=True)
dataset = pd.concat([submission_dataset, captions_dataset])
del submission_dataset, captions_dataset
dataset = Dataset.from_pandas(dataset)
GEMMA_2B_IT_MODEL_PREFIX_TEXT = "Below is an coding test problem. Solve the question."
def generate_prompt(data_point):
return f"<bos><start_of_turn>user {GEMMA_2B_IT_MODEL_PREFIX_TEXT}
I don't know {data_point['title']} problem. give me the insight or appoach.
this is problem's hint.
{data_point['question_hints']}
here are some content of question.
{data_point['question_content']}<end_of_turn>
<start_of_turn>model {data_point['content']}<end_of_turn><eos>"
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)
ํ๋ จ
from trl import SFTTrainer
import transformers
import torch
tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="prompt",
peft_config=lora_config,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
args=transformers.TrainingArguments(
output_dir='out',
bf16=True,
max_steps=100,
warmup_steps=50,
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
optim="paged_adamw_8bit",
logging_steps=20,
report_to='wandb',
),
)
trainer.train()
ํ๊ฐ
๋ชจ๋ธ์ ์ฑ๋ฅ์ ๋ค์๊ณผ ๊ฐ์ด ํ๊ฐ๋์์ต๋๋ค:
Metric | Value |
---|---|
Average | 41.62 |
ARC | 41.81 |
HellaSwag | 59.03 |
MMLU | 37.26 |
TruthfulQA | 43.45 |
Winogrande | 59.91 |
GSM8K | 8.26 |
์ ํ ์ฌํญ ๋ฐ ์ค๋ฆฌ์ ๊ณ ๋ ค์ฌํญ
- ๋ชจ๋ธ์ ์ถ๋ ฅ์ ํ์ต ๋ฐ์ดํฐ์ ๊ธฐ๋ฐํ๋ฏ๋ก ํญ์ ์ ํํ์ง ์์ ์ ์์ต๋๋ค.
- ์ค์ํ ๊ฒฐ์ ์ด๋ ์ค์ธ๊ณ ๋ฌธ์ ํด๊ฒฐ์ ๋ชจ๋ธ ์ถ๋ ฅ์ ์ฌ์ฉํ๊ธฐ ์ ์ ๋ฐ๋์ ๊ฒ์ฆ์ด ํ์ํฉ๋๋ค.