File size: 3,134 Bytes
8e8a9cd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
base_model: meta-llama/Llama-2-7b-hf
---
# Model Details
- SFT based on [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) with merged alpaca datasets
- DPO: trained on top of SFT model as LoRa Adapter, with merged [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) data
- PPO: trained on top of dpo model and reward model, with multi-adapters, with [PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) data for futher RLHF
- Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2
## Model and Training Details
- **Finetuned from model:** [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- **Dataset:**
- SFT (mixed train):
- [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned)
- [vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4)
- DPO (mixed train):
- [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- [Unified-Language-Model-Alignment/Anthropic_HH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden)
- PPO:
- [PKU-Alignment/PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K)
- [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K)
- [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF)
### Training Results
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F65b1dd2a855f6b5fe621bc0e%2Fmiik5Tb6A8G6sDTlnQA-V.png%3C%2Fspan%3E)
### Evaluation
The reward score and toxicity scores are computed and compared with [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) data on SFT/DPO/PPO models
| Model | Toxicity | Reward |
| ----- |:--------:|:--------:|
| SFT_v0.1 | 0.0698 | -0.2828 |
| DPO_v0.1 | 0.0356 | -0.2633 |
| PPO_v0.1 | 0.0321 | 0.38 |
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F65b1dd2a855f6b5fe621bc0e%2Fm-k6kUuIJVTkYM2l3uBPd.png%3C%2Fspan%3E)%3C%2Fspan%3E
### Compute Infrastructure
The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB
### Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token_id = tokenizer.eos_token_id
def format_prompt(question):
return f"###Question: {question}\n###Answer: "
instruction = "Your text here"
input = format_prompt(instruction)
inputs = tokenizer(input, return_tensors='pt')
output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1)
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(output)
```
## Model Card Authors
Yiyu (Michael) Ren
## Model Card Contact
Email: [email protected]
### Framework versions
- PEFT 0.8.2 |