renyiyu commited on
Commit
8e8a9cd
·
verified ·
1 Parent(s): c7660ae

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: meta-llama/Llama-2-7b-hf
3
+ ---
4
+
5
+ # Model Details
6
+
7
+ - SFT based on [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) with merged alpaca datasets
8
+ - DPO: trained on top of SFT model as LoRa Adapter, with merged [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) data
9
+ - PPO: trained on top of dpo model and reward model, with multi-adapters, with [PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) data for futher RLHF
10
+ - Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2
11
+
12
+
13
+ ## Model and Training Details
14
+
15
+ - **Finetuned from model:** [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
16
+
17
+ - **Dataset:**
18
+ - SFT (mixed train):
19
+ - [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned)
20
+ - [vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4)
21
+ - DPO (mixed train):
22
+ - [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
23
+ - [Unified-Language-Model-Alignment/Anthropic_HH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden)
24
+ - PPO:
25
+ - [PKU-Alignment/PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K)
26
+ - [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K)
27
+ - [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF)
28
+
29
+ ### Training Results
30
+
31
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/miik5Tb6A8G6sDTlnQA-V.png)
32
+
33
+ ### Evaluation
34
+
35
+ The reward score and toxicity scores are computed and compared with [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) data on SFT/DPO/PPO models
36
+
37
+ | Model | Toxicity | Reward |
38
+ | ----- |:--------:|:--------:|
39
+ | SFT_v0.1 | 0.0698 | -0.2828 |
40
+ | DPO_v0.1 | 0.0356 | -0.2633 |
41
+ | PPO_v0.1 | 0.0321 | 0.38 |
42
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/m-k6kUuIJVTkYM2l3uBPd.png)
43
+
44
+ ### Compute Infrastructure
45
+
46
+ The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB
47
+
48
+ ### Inference
49
+ ```python
50
+ from transformers import AutoModelForCausalLM, AutoTokenizer
51
+
52
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,)
53
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)
54
+
55
+ tokenizer.pad_token = tokenizer.eos_token
56
+ tokenizer.eos_token = DEFINE_EOS_TOKEN
57
+ model.config.eos_token = DEFINE_EOS_TOKEN
58
+ model.config.eos_token_id = tokenizer.eos_token_id
59
+
60
+ def format_prompt(question):
61
+ return f"###Question: {question}\n###Answer: "
62
+
63
+ instruction = "Your text here"
64
+ input = format_prompt(instruction)
65
+ inputs = tokenizer(input, return_tensors='pt')
66
+ output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1)
67
+ output = tokenizer.decode(output[0], skip_special_tokens=True)
68
+ print(output)
69
+
70
+ ```
71
+ ## Model Card Authors
72
+
73
+ Yiyu (Michael) Ren
74
+
75
+ ## Model Card Contact
76
+
77
78
+
79
+ ### Framework versions
80
+
81
+ - PEFT 0.8.2