winglian commited on
Commit
0a472e1
·
unverified ·
1 Parent(s): 5cb7ea4

quickstart instructions for starting from runpod (#5)

Browse files
README.md CHANGED
@@ -24,7 +24,97 @@ datasets:
24
  - Optionally Download some datasets, see [data/README.md](data/README.md)
25
 
26
 
27
- - Create a new or update the existing YAML config [config/pythia_1_2B_alpaca.yml](config/pythia_1_2B_alpaca.yml)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  - Install python dependencies with ONE of the following:
29
 
30
  - `pip3 install -e .[int4]` (recommended)
@@ -54,3 +144,29 @@ use_cpu: false
54
 
55
  - Train! `accelerate launch scripts/finetune.py`, make sure to choose the correct YAML config file
56
  - Alternatively you can pass in the config file like: `accelerate launch scripts/finetune.py configs/llama_7B_alpaca.yml`~~
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  - Optionally Download some datasets, see [data/README.md](data/README.md)
25
 
26
 
27
+ - Create a new or update the existing YAML config [config/sample.yml](config/sample.yml)
28
+
29
+ ```yaml
30
+ # this is the huggingface model that contains *.pt, *.safetensors, or *.bin files
31
+ # this can also be a relative path to a model on disk
32
+ base_model: decapoda-research/llama-7b-hf-int4
33
+ # you can specify an ignore pattern if the model repo contains more than 1 model type (*.pt, etc)
34
+ base_model_ignore_patterns:
35
+ # if the base_model repo on hf hub doesn't include configuration .json files,
36
+ # you can set that here, or leave this empty to default to base_model
37
+ base_model_config: decapoda-research/llama-7b-hf
38
+ # If you want to specify the type of model to load, AutoModelForCausalLM is a good choice too
39
+ model_type: AutoModelForCausalLM
40
+ # Corresponding tokenizer for the model AutoTokenizer is a good choice
41
+ tokenizer_type: AutoTokenizer
42
+ # whether you are training a 4-bit quantized model
43
+ load_4bit: true
44
+ # this will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
45
+ load_in_8bit: true
46
+ # a list of one or more datasets to finetune the model with
47
+ datasets:
48
+ # this can be either a hf dataset, or relative path
49
+ - path: vicgalle/alpaca-gpt4
50
+ # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
51
+ type: alpaca
52
+ # axolotl attempts to save the dataset as an arrow after packing the data together so
53
+ # subsequent training attempts load faster, relative path
54
+ dataset_prepared_path: data/last_run_prepared
55
+ # How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc
56
+ val_set_size: 0.04
57
+ # if you want to use lora, leave blank to train all parameters in original model
58
+ adapter: lora
59
+ # if you already have a lora model trained that you want to load, put that here
60
+ lora_model_dir:
61
+ # the maximum length of an input to train with, this should typically be less than 2048
62
+ # as most models have a token/context limit of 2048
63
+ sequence_len: 2048
64
+ # max sequence length to concatenate training samples together up to
65
+ # inspired by StackLLaMA. see https://huggingface.co/blog/stackllama#supervised-fine-tuning
66
+ max_packed_sequence_len: 1024
67
+ # lora hyperparameters
68
+ lora_r: 8
69
+ lora_alpha: 16
70
+ lora_dropout: 0.05
71
+ lora_target_modules:
72
+ - q_proj
73
+ - v_proj
74
+ # - k_proj
75
+ # - o_proj
76
+ lora_fan_in_fan_out: false
77
+ # wandb configuration if your're using it
78
+ wandb_project:
79
+ wandb_watch:
80
+ wandb_run_id:
81
+ wandb_log_model: checkpoint
82
+ # where to save the finsihed model to
83
+ output_dir: ./completed-model
84
+ # training hyperparameters
85
+ batch_size: 8
86
+ micro_batch_size: 2
87
+ num_epochs: 3
88
+ warmup_steps: 100
89
+ learning_rate: 0.00003
90
+ # whether to mask out or include the human's prompt from the training labels
91
+ train_on_inputs: false
92
+ # don't use this, leads to wonky training (according to someone on the internet)
93
+ group_by_length: false
94
+ # Use CUDA bf16
95
+ bf16: true
96
+ # Use CUDA tf32
97
+ tf32: true
98
+ # does not work with current implementation of 4-bit LoRA
99
+ gradient_checkpointing: false
100
+ # stop training after this many evaluation losses have increased in a row
101
+ # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
102
+ early_stopping_patience: 3
103
+ # specify a scheduler to use with the optimizer. only one_cycle is supported currently
104
+ lr_scheduler:
105
+ # whether to use xformers attention patch https://github.com/facebookresearch/xformers:
106
+ xformers_attention:
107
+ # whether to use flash attention patch https://github.com/HazyResearch/flash-attention:
108
+ flash_attention:
109
+ # resume from a specific checkpoint dir
110
+ resume_from_checkpoint:
111
+ # if resume_from_checkpoint isn't set and you simply want it to start where it left off
112
+ # be careful with this being turned on between different models
113
+ auto_resume_from_checkpoints: false
114
+ # don't mess with this, it's here for accelerate and torchrun
115
+ local_rank:
116
+ ```
117
+
118
  - Install python dependencies with ONE of the following:
119
 
120
  - `pip3 install -e .[int4]` (recommended)
 
144
 
145
  - Train! `accelerate launch scripts/finetune.py`, make sure to choose the correct YAML config file
146
  - Alternatively you can pass in the config file like: `accelerate launch scripts/finetune.py configs/llama_7B_alpaca.yml`~~
147
+
148
+
149
+ ## How to start training on Runpod in under 10 minutes
150
+
151
+ - Choose your Docker container wisely.
152
+ - I recommend `huggingface:transformers-pytorch-deepspeed-latest-gpu` see https://hub.docker.com/r/huggingface/transformers-pytorch-deepspeed-latest-gpu/
153
+ - Once you start your runpod, and SSH into it:
154
+ ```shell
155
+ source <(curl -s https://raw.githubusercontent.com/winglian/axolotl/main/scripts/setup-runpod.sh)
156
+ ```
157
+
158
+ - Once the setup script completes
159
+ ```shell
160
+ accelerate launch scripts/finetune.py configs/quickstart.yml
161
+ ```
162
+
163
+ - Here are some helpful environment variables you'll want to manually set if you open a new shell
164
+ ```shell
165
+ export WANDB_MODE=offline
166
+ export WANDB_CACHE_DIR=/workspace/data/wandb-cache
167
+ export HF_DATASETS_CACHE="/workspace/data/huggingface-cache/datasets"
168
+ export HUGGINGFACE_HUB_CACHE="/workspace/data/huggingface-cache/hub"
169
+ export TRANSFORMERS_CACHE="/workspace/data/huggingface-cache/hub"
170
+ export NCCL_P2P_DISABLE=1
171
+ ```
172
+
configs/accelerate/default_config.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ compute_environment: LOCAL_MACHINE
2
+ distributed_type: 'NO'
3
+ downcast_bf16: 'no'
4
+ gpu_ids: all
5
+ machine_rank: 0
6
+ main_training_function: main
7
+ mixed_precision: bf16
8
+ num_machines: 1
9
+ num_processes: 1
10
+ rdzv_backend: static
11
+ same_network: true
12
+ tpu_env: []
13
+ tpu_use_cluster: false
14
+ tpu_use_sudo: false
15
+ use_cpu: false
configs/llama_7B_4bit.yml CHANGED
@@ -4,7 +4,7 @@ model_type: LlamaForCausalLM
4
  tokenizer_type: LlamaTokenizer
5
  load_in_8bit: true
6
  datasets:
7
- - path: vicgalle/alpaca-gpt4
8
  type: alpaca
9
  dataset_prepared_path: data/last_run_prepared
10
  val_set_size: 0.04
@@ -29,6 +29,7 @@ output_dir: ./lora-test
29
  batch_size: 8
30
  micro_batch_size: 2
31
  num_epochs: 3
 
32
  learning_rate: 0.00003
33
  train_on_inputs: false
34
  group_by_length: false
@@ -37,5 +38,8 @@ tf32: true
37
  gradient_checkpointing: false
38
  early_stopping_patience: 3
39
  resume_from_checkpoint:
 
40
  local_rank:
41
  load_4bit: true
 
 
 
4
  tokenizer_type: LlamaTokenizer
5
  load_in_8bit: true
6
  datasets:
7
+ - path: tatsu-lab/alpaca # original alpaca dataset
8
  type: alpaca
9
  dataset_prepared_path: data/last_run_prepared
10
  val_set_size: 0.04
 
29
  batch_size: 8
30
  micro_batch_size: 2
31
  num_epochs: 3
32
+ warmup_steps: 100
33
  learning_rate: 0.00003
34
  train_on_inputs: false
35
  group_by_length: false
 
38
  gradient_checkpointing: false
39
  early_stopping_patience: 3
40
  resume_from_checkpoint:
41
+ auto_resume_from_checkpoints: true
42
  local_rank:
43
  load_4bit: true
44
+ xformers_attention: true
45
+ flash_attention:
configs/quickstart.yml ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ base_model: decapoda-research/llama-7b-hf-int4
2
+ base_model_config: decapoda-research/llama-7b-hf
3
+ model_type: LlamaForCausalLM
4
+ tokenizer_type: LlamaTokenizer
5
+ load_in_8bit: true
6
+ datasets:
7
+ - path: tatsu-lab/alpaca # original alpaca dataset
8
+ type: alpaca
9
+ dataset_prepared_path: data/last_run_prepared
10
+ val_set_size: 0.04
11
+ adapter: lora
12
+ lora_model_dir:
13
+ sequence_len: 1024
14
+ max_packed_sequence_len: 1024
15
+ lora_r: 8
16
+ lora_alpha: 16
17
+ lora_dropout: 0.05
18
+ lora_target_modules:
19
+ - q_proj
20
+ - v_proj
21
+ # - k_proj
22
+ # - o_proj
23
+ lora_fan_in_fan_out: false
24
+ wandb_project:
25
+ wandb_watch:
26
+ wandb_run_id:
27
+ wandb_log_model: checkpoint
28
+ output_dir: ./lora-test
29
+ batch_size: 4
30
+ micro_batch_size: 1
31
+ num_epochs: 3
32
+ warmup_steps: 100
33
+ learning_rate: 0.00003
34
+ train_on_inputs: false
35
+ group_by_length: false
36
+ bf16: true
37
+ tf32: true
38
+ gradient_checkpointing: false
39
+ early_stopping_patience: 3
40
+ resume_from_checkpoint:
41
+ auto_resume_from_checkpoints: true
42
+ local_rank:
43
+ load_4bit: true
44
+ xformers_attention: true
45
+ flash_attention:
configs/sample.yml ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # this is the huggingface model that contains *.pt, *.safetensors, or *.bin files
2
+ # this can also be a relative path to a model on disk
3
+ base_model: decapoda-research/llama-7b-hf-int4
4
+ # you can specify an ignore pattern if the model repo contains more than 1 model type (*.pt, etc)
5
+ base_model_ignore_patterns:
6
+ # if the base_model repo on hf hub doesn't include configuration .json files,
7
+ # you can set that here, or leave this empty to default to base_model
8
+ base_model_config: decapoda-research/llama-7b-hf
9
+ # If you want to specify the type of model to load, AutoModelForCausalLM is a good choice too
10
+ model_type: AutoModelForCausalLM
11
+ # Corresponding tokenizer for the model AutoTokenizer is a good choice
12
+ tokenizer_type: AutoTokenizer
13
+ # whether you are training a 4-bit quantized model
14
+ load_4bit: true
15
+ # this will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
16
+ load_in_8bit: true
17
+ # a list of one or more datasets to finetune the model with
18
+ datasets:
19
+ # this can be either a hf dataset, or relative path
20
+ - path: vicgalle/alpaca-gpt4
21
+ # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
22
+ type: alpaca
23
+ # axolotl attempts to save the dataset as an arrow after packing the data together so
24
+ # subsequent training attempts load faster, relative path
25
+ dataset_prepared_path: data/last_run_prepared
26
+ # How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc
27
+ val_set_size: 0.04
28
+ # if you want to use lora, leave blank to train all parameters in original model
29
+ adapter: lora
30
+ # if you already have a lora model trained that you want to load, put that here
31
+ lora_model_dir:
32
+ # the maximum length of an input to train with, this should typically be less than 2048
33
+ # as most models have a token/context limit of 2048
34
+ sequence_len: 2048
35
+ # max sequence length to concatenate training samples together up to
36
+ # inspired by StackLLaMA. see https://huggingface.co/blog/stackllama#supervised-fine-tuning
37
+ max_packed_sequence_len: 1024
38
+ # lora hyperparameters
39
+ lora_r: 8
40
+ lora_alpha: 16
41
+ lora_dropout: 0.05
42
+ lora_target_modules:
43
+ - q_proj
44
+ - v_proj
45
+ # - k_proj
46
+ # - o_proj
47
+ lora_fan_in_fan_out: false
48
+ # wandb configuration if your're using it
49
+ wandb_project:
50
+ wandb_watch:
51
+ wandb_run_id:
52
+ wandb_log_model: checkpoint
53
+ # where to save the finsihed model to
54
+ output_dir: ./completed-model
55
+ # training hyperparameters
56
+ batch_size: 8
57
+ micro_batch_size: 2
58
+ num_epochs: 3
59
+ warmup_steps: 100
60
+ learning_rate: 0.00003
61
+ # whether to mask out or include the human's prompt from the training labels
62
+ train_on_inputs: false
63
+ # don't use this, leads to wonky training (according to someone on the internet)
64
+ group_by_length: false
65
+ # Use CUDA bf16
66
+ bf16: true
67
+ # Use CUDA tf32
68
+ tf32: true
69
+ # does not work with current implementation of 4-bit LoRA
70
+ gradient_checkpointing: false
71
+ # stop training after this many evaluation losses have increased in a row
72
+ # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
73
+ early_stopping_patience: 3
74
+ # specify a scheduler to use with the optimizer. only one_cycle is supported currently
75
+ lr_scheduler:
76
+ # whether to use xformers attention patch https://github.com/facebookresearch/xformers:
77
+ xformers_attention:
78
+ # whether to use flash attention patch https://github.com/HazyResearch/flash-attention:
79
+ flash_attention:
80
+ # resume from a specific checkpoint dir
81
+ resume_from_checkpoint:
82
+ # if resume_from_checkpoint isn't set and you simply want it to start where it left off
83
+ # be careful with this being turned on between different models
84
+ auto_resume_from_checkpoints: false
85
+ # don't mess with this, it's here for accelerate and torchrun
86
+ local_rank:
requirements.txt CHANGED
@@ -12,3 +12,5 @@ wandb
12
  flash-attn
13
  deepspeed
14
  einops
 
 
 
12
  flash-attn
13
  deepspeed
14
  einops
15
+ xformers
16
+
scripts/finetune.py CHANGED
@@ -225,7 +225,14 @@ def train(
225
  )
226
 
227
  logging.info("Starting trainer...")
228
- trainer.train(resume_from_checkpoint=cfg.resume_from_checkpoint)
 
 
 
 
 
 
 
229
 
230
  if cfg.local_rank == 0:
231
  # TODO do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading
 
225
  )
226
 
227
  logging.info("Starting trainer...")
228
+ resume_from_checkpoint = cfg.resume_from_checkpoint
229
+ if cfg.resume_from_checkpoint is None and cfg.auto_resume_from_checkpoints:
230
+ possible_checkpoints = [str(cp) for cp in Path(cfg.output_dir).glob("checkpoint-*")]
231
+ if len(possible_checkpoints) > 0:
232
+ sorted_paths = sorted(possible_checkpoints, key=lambda path: int(path.split('-')[-1]))
233
+ resume_from_checkpoint = sorted_paths[-1]
234
+ logging.info(f"Using Auto-resume functionality to start with checkpoint at {resume_from_checkpoint}")
235
+ trainer.train(resume_from_checkpoint=resume_from_checkpoint)
236
 
237
  if cfg.local_rank == 0:
238
  # TODO do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading
scripts/setup-runpod.sh ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ export WANDB_MODE=offline
4
+ export WANDB_CACHE_DIR=/workspace/data/wandb-cache
5
+ mkdir -p $WANDB_CACHE_DIR
6
+
7
+ mkdir -p /workspace/data/huggingface-cache/{hub,datasets}
8
+ export HF_DATASETS_CACHE="/workspace/data/huggingface-cache/datasets"
9
+ export HUGGINGFACE_HUB_CACHE="/workspace/data/huggingface-cache/hub"
10
+ export TRANSFORMERS_CACHE="/workspace/data/huggingface-cache/hub"
11
+ export NCCL_P2P_DISABLE=1
12
+
13
+ nvidia-smi
14
+ num_gpus=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
15
+ gpu_indices=$(seq 0 $((num_gpus - 1)) | paste -sd "," -)
16
+ export CUDA_VISIBLE_DEVICES=$gpu_indices
17
+ echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
18
+
19
+ apt-get update
20
+ apt-get install -y build-essential ninja-build vim git-lfs
21
+ git lfs install
22
+ pip3 install --force-reinstall https://download.pytorch.org/whl/nightly/cu117/torch-2.0.0.dev20230301%2Bcu117-cp38-cp38-linux_x86_64.whl --index-url https://download.pytorch.org/whl/nightly/cu117
23
+ if [ -z "${TORCH_CUDA_ARCH_LIST}" ]; then # only set this if not set yet
24
+ # this covers most common GPUs that the installed version of pytorch supports
25
+ # python -c "import torch; print(torch.cuda.get_arch_list())"
26
+ export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6+PTX"
27
+ fi
28
+
29
+ cd /workspace/
30
+ git clone https://github.com/winglian/axolotl.git
31
+ cd axolotl
32
+ pip install -e .[int4]
33
+ mkdir -p ~/.cache/huggingface/accelerate/
34
+ cp configs/accelerate/default_config.yml ~/.cache/huggingface/accelerate/default_config.yml
src/axolotl/utils/models.py CHANGED
@@ -66,7 +66,10 @@ def load_model(
66
  from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram
67
  from huggingface_hub import snapshot_download
68
 
69
- cache_model_path = Path(snapshot_download(base_model))
 
 
 
70
  files = (
71
  list(cache_model_path.glob("*.pt"))
72
  + list(cache_model_path.glob("*.safetensors"))
 
66
  from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram
67
  from huggingface_hub import snapshot_download
68
 
69
+ snapshot_download_kwargs = {}
70
+ if cfg.base_model_ignore_patterns:
71
+ snapshot_download_kwargs["ignore_patterns"] = cfg.base_model_ignore_patterns
72
+ cache_model_path = Path(snapshot_download(base_model, ** snapshot_download_kwargs))
73
  files = (
74
  list(cache_model_path.glob("*.pt"))
75
  + list(cache_model_path.glob("*.safetensors"))
src/axolotl/utils/trainer.py CHANGED
@@ -11,9 +11,9 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer):
11
  total_num_steps = int(
12
  math.ceil(len(train_dataset) * cfg.num_epochs / cfg.batch_size)
13
  )
14
- warmup_steps = min(int(0.03 * total_num_steps), 100)
15
  logging_steps = max(min(int(0.005 * total_num_steps), 10), 1)
16
- save_steps = eval_steps = min(int(0.05 * total_num_steps), 200)
17
 
18
  training_arguments_kwargs = {}
19
  if cfg.bf16 == "full":
@@ -45,24 +45,23 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer):
45
  **training_arguments_kwargs,
46
  )
47
 
48
- decay_parameters = get_parameter_names(model, [nn.LayerNorm])
49
- decay_parameters = [name for name in decay_parameters if "bias" not in name]
50
- optimizer_grouped_parameters = [
51
- {
52
- "params": [p for n, p in model.named_parameters() if n in decay_parameters],
53
- "weight_decay": training_args.weight_decay,
54
- },
55
- {
56
- "params": [
57
- p for n, p in model.named_parameters() if n not in decay_parameters
58
- ],
59
- "weight_decay": 0.0,
60
- },
61
- ]
62
-
63
  trainer_kwargs = {}
64
 
65
  if cfg.load_in_8bit and not cfg.load_4bit:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  optimizer = bnb.optim.Adam8bit(
67
  optimizer_grouped_parameters,
68
  betas=(training_args.adam_beta1, training_args.adam_beta2),
 
11
  total_num_steps = int(
12
  math.ceil(len(train_dataset) * cfg.num_epochs / cfg.batch_size)
13
  )
14
+ warmup_steps = cfg.warmup_steps if cfg.warmup_steps else min(int(0.03 * total_num_steps), 100)
15
  logging_steps = max(min(int(0.005 * total_num_steps), 10), 1)
16
+ save_steps = eval_steps = cfg.save_steps if cfg.save_steps else min(int(0.05 * total_num_steps), 200)
17
 
18
  training_arguments_kwargs = {}
19
  if cfg.bf16 == "full":
 
45
  **training_arguments_kwargs,
46
  )
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  trainer_kwargs = {}
49
 
50
  if cfg.load_in_8bit and not cfg.load_4bit:
51
+ decay_parameters = get_parameter_names(model, [nn.LayerNorm])
52
+ decay_parameters = [name for name in decay_parameters if "bias" not in name]
53
+ optimizer_grouped_parameters = [
54
+ {
55
+ "params": [p for n, p in model.named_parameters() if n in decay_parameters],
56
+ "weight_decay": training_args.weight_decay,
57
+ },
58
+ {
59
+ "params": [
60
+ p for n, p in model.named_parameters() if n not in decay_parameters
61
+ ],
62
+ "weight_decay": 0.0,
63
+ },
64
+ ]
65
  optimizer = bnb.optim.Adam8bit(
66
  optimizer_grouped_parameters,
67
  betas=(training_args.adam_beta1, training_args.adam_beta2),