Lambent commited on
Commit
a4d184c
·
verified ·
1 Parent(s): 7da7e1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -2
README.md CHANGED
@@ -23,9 +23,96 @@ output = generator([{"role": "user", "content": question}], max_new_tokens=128,
23
  print(output["generated_text"])
24
  ```
25
 
26
- ## Training procedure
27
 
28
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/logical-luminosity/eidolon-qwen2.5-qlora-dpo-3/runs/vja6mgde)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  This model was trained with DPO, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
31
 
 
23
  print(output["generated_text"])
24
  ```
25
 
26
+ ## Axolotl Config
27
 
28
+ ```
29
+ base_model: Lambent/ProtoEidolon-v2.2.4-14B
30
+ model_type: AutoModelForCausalLM
31
+ tokenizer_type: AutoTokenizer
32
+ trust_remote_code: true
33
+
34
+ save_safetensors: true
35
+
36
+ load_in_8bit: false
37
+ load_in_4bit: true
38
+ strict: false
39
+
40
+ rl: dpo
41
+ chat_template: chatml
42
+ # total_num_tokens:
43
+ datasets:
44
+ - path: Lambent/rp-teacher-synth-dpo
45
+ split: train
46
+ type: chatml.prompt_pairs
47
+ - path: unalignment/toxic-dpo-v0.2
48
+ split: train
49
+ type: chatml.prompt_pairs
50
+ - path: sam-paech/gutenbergs_1_2_3_antislop-dpo
51
+ split: train
52
+ type: chatml.ultra
53
+ - path: Trelis/orpo-dpo-mix-40k-SHORT
54
+ split: train
55
+ type: chatml.ultra
56
+
57
+ dataset_prepared_path: prepared-dpo
58
+ output_dir: ./dpoq
59
+ val_set_size: 0.001
60
+
61
+ seed: 213
62
+
63
+ sequence_len: 2048
64
+ sample_packing: false
65
+ eval_sample_packing: false
66
+ pad_to_sequence_len: false
67
+
68
+ adapter: qlora
69
+ lora_model_dir:
70
+ lora_r: 256
71
+ lora_alpha: 256
72
+ lora_dropout: 0.05
73
+ lora_target_linear: true
74
+ lora_fan_in_fan_out:
75
+ peft_use_dora: true
76
+
77
+ wandb_project: eidolon-qwen2.5-qlora-dpo-3
78
+ wandb_entity:
79
+ wandb_watch:
80
+ wandb_name:
81
+ wandb_log_model:
82
+
83
+ gradient_accumulation_steps: 16
84
+ micro_batch_size: 2
85
+ num_epochs: 1
86
+ optimizer: paged_adamw_8bit
87
+ lr_scheduler: cosine
88
+ learning_rate: 1e-6
89
+ #cosine_min_lr_ratio: 0.1
90
+ #cosine_constant_lr_ratio: 0.95
91
+
92
+ train_on_inputs: false
93
+ group_by_length: false
94
+ bf16: auto
95
+ fp16:
96
+ tf32: false
97
+
98
+ gradient_checkpointing: true
99
+ early_stopping_patience:
100
+ resume_from_checkpoint:
101
+ local_rank:
102
+ logging_steps: 1
103
+ xformers_attention:
104
+ flash_attention: true
105
+
106
+ warmup_steps: 16
107
+ evals_per_epoch: 8
108
+ saves_per_epoch: 8
109
+ save_total_limit: 2
110
+ debug:
111
+ deepspeed:
112
+ weight_decay: 0.001
113
+ fsdp:
114
+ fsdp_config:
115
+ ```
116
 
117
  This model was trained with DPO, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
118