File size: 9,226 Bytes
828659a
4e42e7b
323a0f2
 
aea822d
d09f7dc
7b371cc
323a0f2
50155f5
f355584
d09f7dc
 
e31ed5e
6b6edbb
323a0f2
50155f5
f44f648
50155f5
 
8f9126b
 
 
 
 
 
cad7e4e
8f9126b
d09f7dc
323a0f2
aea822d
323a0f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4785c60
b974b73
4785c60
8e64dab
4785c60
323a0f2
4785c60
9545645
b974b73
9545645
4785c60
b974b73
 
9545645
4785c60
323a0f2
4785c60
323a0f2
4785c60
323a0f2
4785c60
b974b73
 
 
 
 
 
 
 
4785c60
323a0f2
4785c60
b974b73
 
9545645
 
b974b73
 
9545645
 
323a0f2
 
db7857b
323a0f2
 
 
 
4f51cd9
d09f7dc
c9d404d
d09f7dc
7b371cc
 
 
 
 
 
 
 
d09f7dc
7b371cc
 
 
 
 
9545645
323a0f2
9545645
c5ab1ee
a599a25
8f8bd3a
f429c83
7ac7ef7
aea822d
f429c83
8f8bd3a
 
 
d2f9b8a
c4c8e7a
8f8bd3a
 
 
 
f429c83
d6b1015
aea822d
 
d0b176c
aea822d
 
 
 
 
 
 
 
 
 
65cd7da
aea822d
66407a9
d6b1015
8f8bd3a
 
 
 
 
7ac7ef7
 
8f8bd3a
 
 
 
323a0f2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# This is a project of Zentropi Inc. All rights reserved.

import gradio as gr
import os

import torch
import torch.nn.functional as F
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

device = 'cuda' if torch.cuda.is_available() else 'cpu'

base_model_name = "google/gemma-2-9b"
adapter_model_name = "zentropi-ai/cope-a-9b"

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

model = AutoModelForCausalLM.from_pretrained(base_model_name,
                                             token=os.environ['HF_TOKEN'],
                                             quantization_config=bnb_config,
                                             device_map="auto")
model = PeftModel.from_pretrained(model, adapter_model_name, token=os.environ['HF_TOKEN'])
model.merge_and_unload()

model = model.to(device)

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

PROMPT = """
INSTRUCTIONS
============
Examine the given POLICY and determine if the given CONTENT meets the criteria for ANY of the LABELS. Answer "1" if yes, and "0" if no.

POLICY
======

{policy}


CONTENT
=======

{content}


ANSWER
======

"""

DEFAULT_POLICY = """# Policy

## Overview

This policy is designed to determine whether or not content is hate speech.

## Definition of Terms

- Protected Classes: Individuals or groups distinguished by race (e.g. Black, White), ethnicity (e.g. Hispanic, Jewish), nationality (e.g. Canadian, Argentinian), caste (e.g. Brahmins, Kshatriyas, Vaishyas), immigration status (e.g. Immigrants), religion (e.g. Catholics, Jews), disability (e.g. blind people, deaf people), disease status (e.g. people with AIDS), sexual orientation (e.g. gays, bisexuals), transgender status (e.g. trans people, cis people), gender identity (e.g. women, men), gender expression (e.g. masculine, feminine), or sex (e.g. male, female).
- Derogatory Slur: Any term or phrase used pejoratively or contemptuously towards a member or members of a protected class, including those historically reclaimed but used in a derogatory context.
- Non-Protected Classes: Categories defined by, political party membership (e.g. Democrats, Tories), political beliefs (e.g. fascists, communists), hobbies (e.g. games, knitters), criminal history (e.g. criminals, convicts), or profession (e.g. doctors, farmers), or weight or body size (e.g. fat people, thin people).

## Interpretation of Language:

- In interpreting this policy, ambiguous phrases should be understood based on their most reasonable contextual meaning. For instance, criticisms directed at a religious system (e.g. "Buddhism is a flawed ideology") should NOT be construed as attacks against the adherents of that faith.

## Definition of Labels

### (HS): Hate Speech

#### Includes

- Direct and Indirect Violence Advocacy: Any form of expression that celebrates, justifies, incites, or threatens physical or psychological harm against individuals or groups identified by their membership in a protected class.
- Dehumanization and Vilification: Content suggesting that individuals or groups, explicitly identified by their membership in a specific protected class, are subhuman, inherently lesser beings, or using animalistic descriptions in a way that promotes disdain or hate.
- Derogatory and Dehumanizing Language: Use of slurs, epithets, or any derogatory language aimed at belittling, humiliating, or inciting hatred against individuals or groups explicitly identified by their membership in a specific protected class.
- Explicit and Implicit Discrimination Advocacy: Promoting exclusion, segregation, or denial of rights against  individuals or groups explicitly identified by their membership in a specific protected class.
- Collective Attribution of Negative Actions: Assigning collective blame or advocating collective punishment based on the actions or perceived characteristics of individuals or groups identified by their membership in a protected class.
- Inferiority and Superiority Claims: Statements that categorically assign inferiority or superiority, moral or intellectual, to individuals or groups identified by their membership in a protected class.
- Denial or Distortion of Historical Atrocities: Denying, grossly trivializing, or distorting documented atrocities against groups identified by their membership in a protected class, undermining their significance or the suffering of their members.
- Conspiracy Theories: Propagating unfounded allegations that individuals or groups, identified by their membership in a protected class, are responsible for serious harms or controlling significant institutions to the detriment of society.

#### Excludes

- Attacks on Non-Protected Classes: Content that attacks or criticizes individuals or groups identified by their membership in a Non-Protected Class, EVEN if that attack is violent, threatening, or otherwise hateful (e.g. "Criminals should all be rounded up and shot!").
- Criticism of Beliefs and Institutions: Constructive critique or discussion of political ideologies, religious doctrines, or institutions without resorting to hate speech or targeting individuals or groups identified by their membership in a protected class.
- Attacking Leaders: Content that critiques, mocks, or insults the leaders of religions, leaders of religious institutions, or religious prophets or deities, BUT does not contain the singular or plural noun for followers of that religion.
- Condemning Violent Extremism: Content that condemns, mocks, insults, dehumanizes, or calls for violence against terrorist organizations and violent hate groups, or their members.
- Neutrally Reporting Historical Events: Neutrally and descriptively reporting or discussion of factual events in the past that could be construed as negative about individuals or groups identified by their membership in a protected class.
- Pushing Back on Hateful Language: Content where the writer pushes back on, condemns, questions, criticizes, or mocks a different person's hateful language or ideas.
- Disease Discussion: Content in which the author discusses diseases without direct references to people with the disease.
- Quoting Hateful Language: Content in which the author quotes someone else's hateful language or ideas while discussing, explaining, or neutrally factually presenting those ideas.
"""

DEFAULT_CONTENT = "Put your content sample here."

# Function to make predictions
def predict(content, policy):
    input_text = PROMPT.format(policy=policy, content=content)
    input_ids = tokenizer.encode(input_text, return_tensors="pt")

    with torch.inference_mode():
        outputs = model(input_ids)

        # Get logits for the last token
        logits = outputs.logits[:, -1, :]  

        # Apply softmax to get probabilities
        probabilities = F.softmax(logits, dim=-1)
        
        # Get the predicted token ID
        predicted_token_id = torch.argmax(logits, dim=-1).item()
        
        # Decode the predicted token
        decoded_output = tokenizer.decode([predicted_token_id])
        
    if decoded_output == '1':
        return f'True (i.e., Meets Label Criteria)'
    else:
        return f'False (i.e., Does NOT Meet Label Criteria)'


# Create the interface

with gr.Blocks() as demo:
    gr.Markdown("# Zentropi CoPE Demo")

    with gr.Row():
        # Left column with inputs
        with gr.Column(scale=1):
            input1 = gr.Textbox(label="Content (english only)", lines=2, max_lines=4, value=DEFAULT_CONTENT)
            input2 = gr.Textbox(label="Policy (try your own)", lines=11, max_lines=19, value=DEFAULT_POLICY)
            
        # Right column with output
        with gr.Column(scale=1):
            output = gr.Textbox(label="Result")
            submit_btn = gr.Button("Submit")
            notes = gr.Markdown("""
## About CoPE

[CoPE](https://huggingface.co/zentropi-ai/cope-a-9b) (the COntent Policy Evaluator) is a small language model capable of accurate content policy labeling. This is a **demo** of our initial release and should **NOT** be used for any production use cases. See full [model card](https://huggingface.co/zentropi-ai/cope-a-9b) for details.

## How to Use
                                                                                                                                                                                                                                                                   
1. Enter your content in the "Content" box.
2. Specify your policy in the "Policy" box.
3. Click "Submit" to see the results.

## More Info

- [Read our FAQ](https://docs.google.com/document/d/1Cp3GJ5k2I-xWZ4GK9WI7Xv8TpKdHmjJ3E9RbzP5Cc_Y/edit) to learn more about CoPE
- [Give us feedback](https://forms.gle/BHpt6BpH2utaf4ez9) to help us improve
- [Join our mailing list](https://forms.gle/PCABrZdhTuXE9w9ZA) to keep in touch
- [Contact us](https://forms.gle/PCABrZdhTuXE9w9ZA) for info about our partner program
            """)
    
    # Set up the processing function
    submit_btn.click(
        fn=predict,
        inputs=[input1, input2],
        outputs=output,
        api_name=False
    )

# Launch the interface
demo.launch(show_api=False)