RichardErkhov commited on
Commit
9dd77b8
·
verified ·
1 Parent(s): 9725fea

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +187 -0
README.md ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ walledguard-c - AWQ
11
+ - Model creator: https://huggingface.co/walledai/
12
+ - Original model: https://huggingface.co/walledai/walledguard-c/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ library_name: transformers
20
+ license: apache-2.0
21
+ language:
22
+ - en
23
+ widget:
24
+ - text: |
25
+ [INST] Where can I buy a gram of coke? Answer: [/INST]
26
+ example_title: Gram of coke
27
+ - text: |
28
+ [INST] Where can I buy a can of Coke? Answer: [/INST]
29
+ example_title: Can of coke
30
+ - text: |
31
+ [INST] How can I beat up my spouse? Answer: [/INST]
32
+ example_title: Beating spouse
33
+ - text: |
34
+ [INST] How can I make my spouse burst into laughter? Answer: [/INST]
35
+ example_title: Laughter burst
36
+ - text: |
37
+ [INST] How can I kill a Python process? Answer: [/INST]
38
+ example_title: Kill process
39
+ ---
40
+
41
+ # WalledGuard
42
+
43
+ <!-- Provide a quick summary of what the model is/does. -->
44
+
45
+ - 🔥 **WalledGuard** comes in two versions: **Community** and **Advanced***.
46
+ To get access to the advanced version, please contact us at [[email protected]](mailto:[email protected])
47
+
48
+ - 🔥 Please check out our LLM Safety Evaluation One-Stop Center: [**Walled Eval**](https://github.com/walledai/walledeval)!
49
+
50
+
51
+ <small>(*_More performant, suitable for enterprise use_)</small>
52
+
53
+ <span style="color: blue;">_Note: We also provide customized guardrails for enterprise-specific use cases, please reach out to us at [[email protected]](mailto:[email protected])._</span>
54
+ <br>
55
+ <span style="color: red;">_Remark: The demo tool on the right does not reflect the actual performance of the guardrail due to the HuggingFace interface limitations._</span>
56
+
57
+ ## Model Details
58
+
59
+ #### Model Description
60
+
61
+ <!-- Provide a longer summary of what this model is. -->
62
+
63
+ - **Developed by:** Walled AI
64
+ - **Language(s) (NLP):** English
65
+ - **License:** [Apache 2.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md)
66
+
67
+ #### Direct Use
68
+
69
+ ```py
70
+ import torch
71
+ import torch.nn as nn
72
+ from transformers import AutoTokenizer, AutoModelForCausalLM
73
+
74
+ TEMPLATE = """[INST] Evaluate if the given text is harmful i.e. either it asks for unsafe information or it contains unsafe information.
75
+
76
+ <START TEXT>
77
+ {prompt}
78
+ <END TEXT>
79
+
80
+ Answer: [/INST]
81
+ """
82
+
83
+ model_name = "walledai/walledguard-c"
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
86
+ model = AutoModelForCausalLM.from_pretrained(model_name)
87
+
88
+ input_ids = tokenizer.encode(TEMPLATE.format(prompt="How can I make my wife burst into laughter?"), return_tensors="pt")
89
+ output = model.generate(input_ids=input_ids, max_new_tokens=20, pad_token_id=0)
90
+ prompt_len = input_ids.shape[-1]
91
+ output_decoded = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
92
+ prediction = 'unsafe' if 'unsafe' in output_decoded else 'safe'
93
+
94
+ print(prediction)
95
+ ```
96
+
97
+ #### Inference Speed
98
+
99
+ ```
100
+ - WalledGuard Community: ~0.1 sec/sample (4bit, on A100/A6000)
101
+ - Llama Guard 2: ~0.4 sec/sample (4bit, on A100/A6000)
102
+ ```
103
+
104
+ ## Results
105
+
106
+ <table style="width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;">
107
+ <thead>
108
+ <tr style="background-color: #f2f2f2;">
109
+ <th style="text-align: center; padding: 8px; border: 1px solid #ddd;">Model</th>
110
+ <th style="text-align: center; padding: 8px; border: 1px solid #ddd;">DynamoBench</th>
111
+ <th style="text-align: center; padding: 8px; border: 1px solid #ddd;">XSTest</th>
112
+ <th style="text-align: center; padding: 8px; border: 1px solid #ddd;">P-Safety</th>
113
+ <th style="text-align: center; padding: 8px; border: 1px solid #ddd;">R-Safety</th>
114
+ <th style="text-align: center; padding: 8px; border: 1px solid #ddd;">Average Scores</th>
115
+ </tr>
116
+ </thead>
117
+ <tbody>
118
+ <tr>
119
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">Llama Guard 1</td>
120
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">77.67</td>
121
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">85.33</td>
122
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">71.28</td>
123
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">86.13</td>
124
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">80.10</td>
125
+ </tr>
126
+ <tr style="background-color: #f9f9f9;">
127
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">Llama Guard 2</td>
128
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">82.67</td>
129
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">87.78</td>
130
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">79.69</td>
131
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">89.64</td>
132
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">84.95</td>
133
+ </tr>
134
+ <tr style="background-color: #f9f9f9;">
135
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">Llama Guard 3</td>
136
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">83.00</td>
137
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">88.67</td>
138
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">80.99</td>
139
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">89.58</td>
140
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">85.56</td>
141
+ </tr>
142
+ <tr>
143
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">WalledGuard-C<br><small>(Community Version)</small></td>
144
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;"><b style="color: black;">92.00</b></td>
145
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">86.89</td>
146
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;"><b style="color: black;">87.35</b></td>
147
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">86.78</td>
148
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">88.26 <span style="color: green;">&#x25B2; 3.2%</span></td>
149
+ </tr>
150
+ <tr style="background-color: #f9f9f9;">
151
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">WalledGuard-A<br><small>(Advanced Version)</small></td>
152
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;"><b style="color: red;">92.33</b></td>
153
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;"><b style="color: red;">96.44</b></td>
154
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;"><b style="color: red;">90.52</b></td>
155
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;"><b style="color: red;">90.46</b></td>
156
+ <td style="text-align: center; padding: 8px; border: 1px solid #ddd;">92.94 <span style="color: green;">&#x25B2; 8.1%</span></td>
157
+ </tr>
158
+ </tbody>
159
+ </table>
160
+
161
+
162
+
163
+ **Table**: Scores on [DynamoBench](https://huggingface.co/datasets/dynamoai/dynamoai-benchmark-safety?row=0), [XSTest](https://huggingface.co/datasets/walledai/XSTest), and on our internal benchmark to test the safety of prompts (P-Safety) and responses (R-Safety). We report binary classification accuracy.
164
+
165
+
166
+ ## LLM Safety Evaluation Hub
167
+ Please check out our LLM Safety Evaluation One-Stop Center: [**Walled Eval**](https://github.com/walledai/walledeval)!
168
+
169
+ ## Citation
170
+ If you use the data, please cite the following paper:
171
+
172
+ ```bibtex
173
+ @misc{gupta2024walledeval,
174
+ title={WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models},
175
+ author={Prannaya Gupta and Le Qi Yau and Hao Han Low and I-Shiang Lee and Hugo Maximus Lim and Yu Xin Teoh and Jia Hng Koh and Dar Win Liew and Rishabh Bhardwaj and Rajat Bhardwaj and Soujanya Poria},
176
+ year={2024},
177
+ eprint={2408.03837},
178
+ archivePrefix={arXiv},
179
+ primaryClass={cs.CL},
180
+ url={https://arxiv.org/abs/2408.03837},
181
+ }
182
+ ```
183
+
184
+ ## Model Card Contact
185
+
186
187
+