Ta20230804
/

llm-jp-3-13b-finetune

@@ -13,50 +13,40 @@ base_model:
 pipeline_tag: text-generation
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-このモデルは、国立情報学研究所の大規模言語モデル研究開発センターによって開発された大規模言語モデルllm-jp/llm-jp-3-13bをベースに、
-データセットichikara-instruction-003-001-1でファインチューニングしたモデルです。
 開発は東京大学松尾研究室の「大規模言語モデルDeepLearning応用講座」によって全面的にサポートされました。
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub.
-- **Developed by:** Ta20230804
-- **Language(s) (NLP):** Japanese
-- **License:** apache-2.0
-- **Finetuned from model [optional]:** llm-jp/llm-jp-3-13b
-## Uses
-## How to Get Started with the Model
-### 実行環境
-- Omnicampus
 - L4 ( GPU:24GB)
-### 推論実行方法
-- ライブラリのインストール
 ```
 !pip install -U bitsandbytes
 !pip install -U transformers
 !pip install -U accelerate
 !pip install -U datasets
 !pip install -U peft
 # notebookでインタラクティブな表示を可能とする（ただし、うまく動かない場合あり）
 !pip install ipywidgets --upgrade
 from transformers import (
     AutoModelForCausalLM,
     AutoTokenizer,
@@ -66,12 +56,6 @@ from peft import PeftModel
 import torch
 from tqdm import tqdm
 import json
-```
-- 各種設定
-```
-# Hugging Faceで取得したTokenをこちらに貼る。
-HF_TOKEN = "Hugging Face Token"
 # ベースとなるモデルと学習したLoRAのアダプタ。
 # model_idの値はomnicampusの環境におけるモデルのパスを表しており、それ以外の環境で実行する場合は変更の必要があります。
@@ -102,10 +86,8 @@ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, toke
 model = PeftModel.from_pretrained(model, adapter_id, token = HF_TOKEN)
 ```
-- データセット読み込み
 ```
-# データセットの読み込み。
-# omnicampusの開発環境では、左にタスクのjsonlをドラッグアンドドロップしてから実行。
 datasets = []
 with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
     item = ""
@@ -117,9 +99,8 @@ with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
         item = ""
 ```
-- 推論
 ```
-# llmjp
 results = []
 for data in tqdm(datasets):
@@ -146,385 +127,12 @@ for data in tqdm(datasets):
   results.append({"task_id": data["task_id"], "input": input, "output": output})
 ```
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-- ichikara-instruction-003-001-1.json
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-```
-!pip install -U pip
-!pip install -U transformers
-!pip install -U bitsandbytes
-!pip install -U accelerate
-!pip install -U datasets
-!pip install -U peft
-!pip install -U trl
-!pip install -U wandb
-!pip install ipywidgets --upgrade
-from transformers import (
-    AutoModelForCausalLM,
-    AutoTokenizer,
-    BitsAndBytesConfig,
-    TrainingArguments,
-    logging,
-)
-from peft import (
-    LoraConfig,
-    PeftModel,
-    get_peft_model,
-)
-import os, torch, gc
-from datasets import load_dataset
-import bitsandbytes as bnb
-from trl import SFTTrainer
-HF_TOKEN = "Hugging Face Token"
-base_model_id = "models/models--llm-jp--llm-jp-3-13b/snapshots/cd3823f4c1fcbb0ad2e2af46036ab1b0ca13192a" #Fine-Tuningするベースモデル
-new_model_id = "llm-jp-3-13b-finetune" #Fine-Tuningしたモデルにつけたい名前
-"""
-bnb_config: 量子化の設定
-  - load_in_4bit:
-      - 4bit量子化形式でモデルをロード
-  - bnb_4bit_quant_type:
-      - 量子化の形式を指定
-  - bnb_4bit_compute_dtype:
-      - 量子化された重みを用い��計算する際のデータ型
-"""
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4", # nf4は通常のINT4より精度が高く、ニューラルネットワークの分布に最適です
-    bnb_4bit_compute_dtype=torch.bfloat16,
-)
-"""
-model: モデル
-  - base_model:
-      - 読み込むベースモデル (事前に定義したもの)
-  - quantization_config:
-      - bnb_configで設定した量子化設定
-  - device_map:
-      - モデルを割り当てるデバイス (CPU/GPU) "auto"で自動に割り当てられます。
-tokenizer: トークナイザー
-  - base_model:
-      - 読み込むベースモデル (事前に定義したもの)
-  - trust_remote_code:
-      - リモートコードの実行を許可 (カスタムモデルなど)
-"""
-model = AutoModelForCausalLM.from_pretrained(
-    base_model_id,
-    quantization_config=bnb_config,
-    device_map="auto"
-)
-tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
-"""
-find_all_linear_names: モデル内の4bit量子化線形層を探します。
-"""
-def find_all_linear_names(model):
-    cls = bnb.nn.Linear4bit # 4bit量子化線形層クラスを指定
-    lora_module_names = set() # ここに取得した線形層を保持します。
-    # モデル内の全てのモジュールを探索します
-    for name, module in model.named_modules():
-        if isinstance(module, cls): # モジュールが4bit量子化線形層の場合
-            names = name.split('.') # モジュールの名前を分割 (ネストされてる際などに対処)
-            lora_module_names.add(names[0] if len(names) == 1 else names[-1]) # 最下層の名前をlora_module_namesに追加
-    # 'lm_head' は16ビット演算の際に除外する必要があるため、lora_module_namesから削除
-    if 'lm_head' in lora_module_names:
-        lora_module_names.remove('lm_head')
-    return list(lora_module_names) # lora_module_namesをリストに変換して返します。
-modules = find_all_linear_names(model)
-"""
-peft_config: PEFTの構成設定
-  - r
-      - LoRA のランク (4, 8, 16 ,32...)
-      - 増やすほど学習が捗るが, 過学習のリスクも高まるので注意
-  - lora_alpha
-      - LoRAのスケーリング係数
-  - lora_dropout
-      - ドロップアウト率（過学習を防ぐための割合）
-  - bias
-      - バイアス項の扱い ("none"の場合、LoRAはバイアスを学習しない)
-  - task_type
-      - タスクタイプ
-  - target_modules
-      - LoRAを適用するターゲットモジュール (前のコードで特定した層)
-"""
-peft_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    lora_dropout=0.05,
-    bias="none",
-    task_type="CAUSAL_LM",
-    target_modules=modules,
-)
-model = get_peft_model(model, peft_config)
-"""
-学習に用いるデータセットの指定
-今回はLLM-jp の公開している Ichikara Instruction を使います。データにアクセスするためには申請が必要ですので、使いたい方のみ申請をしてください。
-Ichikara Instruciton を Hugging Face Hub にて公開することはお控えください。
-また、CC-BY-NC-SAですのでモデルはライセンスを継承する前提でお使いください。
-下記のリンクから申請を終えた先に Google Drive があり、Distribution20241221_all というフォルダごとダウンロードしてください。
-今回は「ichikara-instruction-003-001-1.json」を使います。必要であれば展開（!unzip など）し、データセットのパスを適切に指定してください。
-omnicampusの開発環境では取得したデータを左側にドラッグアンドドロップしてお使いください。
-https://liat-aip.sakura.ne.jp/wp/llmのための日本語インストラクションデータ作成/llmのための日本語インストラクションデータ-公開/
-関根聡, 安藤まや, 後藤美知子, 鈴木久美, 河原大輔, 井之上直也, 乾健太郎. ichikara-instruction: LLMのための日本語インストラクションデータの構築. 言語処理学会第30回年次大会(2024)
-"""
-dataset = load_dataset("json", data_files="./ichikara-instruction-003-001-1.json")
-dataset
-prompt = """### 指示
-{}
-### 回答
-{}"""
-"""
-formatting_prompts_func: 各データをプロンプトに合わせた形式に合わせる
-"""
-EOS_TOKEN = tokenizer.eos_token # トークナイザーのEOSトークン（文末トークン）
-def formatting_prompts_func(examples):
-    input = examples["text"] # 入力データ
-    output = examples["output"] # 出力データ
-    text = prompt.format(input, output) + EOS_TOKEN # プロンプトの作成
-    return { "formatted_text" : text, } # 新しいフィールド "formatted_text" を返す
-pass
-dataset = dataset.map(
-    formatting_prompts_func,
-    num_proc= 4, # 並列処理数を指定
-)
-dataset
-print(dataset["train"]["formatted_text"][3])
-"""
-training_arguments: 学習の設定
-  - output_dir:
-      -トレーニング後のモデルを保存するディレクトリ
-  - per_device_train_batch_size:
-      - デバイスごとのトレーニングバッチサイズ
-  - per_device_
-  _batch_size:
-      - デバイスごとの評価バッチサイズ
-  - gradient_accumulation_steps:
-      - 勾配を更新する前にステップを積み重ねる回数
-  - optim:
-      - オプティマイザの設定
-  - num_train_epochs:
-      - エポック数
-  - eval_strategy:
-      - 評価の戦略 ("no"/"steps"/"epoch")
-  - eval_steps:
-      - eval_strategyが"steps"のとき、評価を行うstep間隔
-  - logging_strategy:
-      - ログ記録の戦略
-  - logging_steps:
-      - ログを出力するステップ間隔
-  - warmup_steps:
-      - 学習率のウォームアップステップ数
-  - save_steps:
-      - モデルを保存するステップ間隔
-  - save_total_limit:
-      - 保存しておくcheckpointの数
-  - max_steps:
-      - トレーニングの最大ステップ数
-  - learning_rate:
-      - 学習率
-  - fp16:
-      - 16bit浮動小数点の使用設定（第8回演習を参考にすると良いです）
-  - bf16:
-      - BFloat16の使用設定
-  - group_by_length:
-      -  入力シーケンスの長さによりバッチをグループ化 (トレーニングの効率化)
-  - report_to:
-      - ログの送信先 ("wandb"/"tensorboard"など)
-"""
-training_arguments = TrainingArguments(
-    output_dir=new_model_id,
-    per_device_train_batch_size=1,
-    gradient_accumulation_steps=2,
-    optim="paged_adamw_32bit",
-    num_train_epochs=1,
-    logging_strategy="steps",
-    logging_steps=10,
-    warmup_steps=10,
-    save_steps=100,
-    save_total_limit = 2,
-    max_steps = -1,
-    learning_rate=5e-5,
-    fp16=False,
-    bf16=False,
-    seed = 3407,
-    group_by_length=True,
-    report_to="none"
-)
-"""
-SFTTrainer: Supervised Fine-Tuningに関する設定
-  - model:
-      - 読み込んだベースのモデル
-  - train_dataset:
-      - トレーニングに使用するデータセット
-  - eval_dataset:
-      - 評価に使用するデータセット
-  - peft_config:
-      - PEFT（Parameter-Efficient Fine-Tuning）の設定（LoRAを利用する場合に指定）
-  - max_seq_length:
-      - モデルに入力されるシーケンスの最大トークン長
-  - dataset_text_field:
-      - データセット内の学習に使うテキストを含むフィールド名
-  - tokenizer:
-      - モデルに対応するトークナイザー
-  - args:
-      - トレーニングに使用するハイパーパラメータ（TrainingArgumentsの設定を指定）
-  - packing:
-      - 入力シーケンスのパッキングを行うかどうかの設定 (False に設定することで、各入力を独立して扱う)
-"""
-trainer = SFTTrainer(
-    model=model,
-    train_dataset=dataset["train"],
-    peft_config=peft_config,
-    max_seq_length= 512,
-    dataset_text_field="formatted_text",
-    tokenizer=tokenizer,
-    args=training_arguments,
-    packing= False,
-)
-model.config.use_cache = False # キャッシュ機能を無効化
-trainer.train() # トレーニングを実行
-import json
-datasets = []
-with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
-    item = ""
-    for line in f:
-      line = line.strip()
-      item += line
-      if item.endswith("}"):
-        datasets.append(json.loads(item))
-        item = ""
-from tqdm import tqdm
-results = []
-for data in tqdm(datasets):
-  input = data["input"]
-  prompt = f"""### 指示
-  {input}
-  ### 回答
-  """
-  tokenized_input = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
-  attention_mask = torch.ones_like(tokenized_input)
-  with torch.no_grad():
-      outputs = model.generate(
-          tokenized_input,
-          attention_mask=attention_mask,
-          max_new_tokens=100,
-          do_sample=False,
-          repetition_penalty=1.2,
-          pad_token_id=tokenizer.eos_token_id
-      )[0]
-  output = tokenizer.decode(outputs[tokenized_input.size(1):], skip_special_tokens=True)
-  results.append({"task_id": data["task_id"], "input": input, "output": output})
 ```
-#### Training Hyperparameters
-- max-seq_length : 512
-- learning_rate : 5e-5
-- per_device_train_batch_size : 1
-- gradient_accumulation_ : 2
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-Omnicampusの自動採点にて2.66
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-elyza-tasks-100-TV_0.jsonl

 pipeline_tag: text-generation
 ---
+# モデル概要
+このモデルは、国立情報学研究所の大規模言語モデル研究開発センターによって開発された大規模言語モデルllm-jp/llm-jp-3-13bをベースに、
+データセットichikara-instruction-003-001-1でファインチューニングしたモデルです。
+elyza-tasks-100-TV_0.jsonlを推論し、その結果を{adapter_id}-outputs.jsonlというファイルに出力できます。
 開発は東京大学松尾研究室の「大規模言語モデルDeepLearning応用講座」によって全面的にサポートされました。
+## 実行環境
+- Python環境（Omnicampus、GoogleColabなど）
 - L4 ( GPU:24GB)
+## 各種インストール
+- ライブラリのインストール、HuggingFaceトークン格納
 ```
 !pip install -U bitsandbytes
 !pip install -U transformers
 !pip install -U accelerate
 !pip install -U datasets
 !pip install -U peft
 # notebookでインタラクティブな表示を可能とする（ただし、うまく動かない場合あり）
 !pip install ipywidgets --upgrade
+# Hugging Faceで取得したTokenをこちらに貼る。
+HF_TOKEN = "Hugging Face Token"
+```
+## モデル・トークナイザの読み込み
+```
 from transformers import (
     AutoModelForCausalLM,
     AutoTokenizer,
 import torch
 from tqdm import tqdm
 import json
 # ベースとなるモデルと学習したLoRAのアダプタ。
 # model_idの値はomnicampusの環境におけるモデルのパスを表しており、それ以外の環境で実行する場合は変更の必要があります。
 model = PeftModel.from_pretrained(model, adapter_id, token = HF_TOKEN)
 ```
+## データセット読み込み
 ```
 datasets = []
 with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
     item = ""
         item = ""
 ```
+## 推論実行
 ```
 results = []
 for data in tqdm(datasets):
   results.append({"task_id": data["task_id"], "input": input, "output": output})
 ```
+## 結果の出力
 ```
+import re
+jsonl_id = re.sub(".*/", "", adapter_id)
+with open(f"./{jsonl_id}-outputs.jsonl", 'w', encoding='utf-8') as f:
+    for result in results:
+        json.dump(result, f, ensure_ascii=False)  # ensure_ascii=False for handling non-ASCII characters
+        f.write('\n')
+```