Spaces:

zentropi-ai
/

cope-demo

Running on L4

App Files Files Community

samidh commited on Dec 11, 2024

Commit

b974b73

verified ·

1 Parent(s): 5d416d8

Update app.py

Browse files

Files changed (1) hide show

app.py +20 -14

app.py CHANGED Viewed

@@ -58,16 +58,19 @@ ANSWER
 DEFAULT_POLICY = """# Policy
-## Instructions
 This policy is designed to determine whether or not content is hate speech.
 ## Definition of Terms
-- Protected Classes: Individuals or groups of individuals defined by sharing a race, ethnicity, religion, disability, sex, gender identity, or sexual orientation. The scope is narrowed to focus on characteristics commonly recognized in anti-discrimination laws. Religions themselves, as concepts ARE NOT protected - only their members are.
-- Non-Protected Classes: Groups defined by other characteristics, such as age, socioeconomic status, and education, which are typically excluded partially or entirely from anti-discrimination laws.
-## Interpretation of Language
 ## Definition of Labels
@@ -75,19 +78,22 @@ This policy is designed to determine whether or not content is hate speech.
 #### Includes
-- Explicit Violence Advocacy: Direct and clear threats of or praise for violence or harm against individuals or groups based on their membership in a particular protected class. Stating an intention to defend against aggression DOES NOT qualify
-- Intentional Dehumanization: Statements that clearly depict individuals or groups as inherently ""other"", alien, animalistic, unintelligent, immoral, unclean, or less-than-fully-human based on their membership in a particular protected class in a way that justifies harm or discrimination.
-- Targeted Use of Derogatory Slurs: Targeting another person or group of people using a one-word name for a particular protected class that has an inherent negative connotation (e.g. Nigger, Kike, Cunt, Retard). Multi-word terms are never slurs.
-- Explicit Discrimination Advocacy: Direct and clear calls for exclusion, segregation, or discrimination against individuals or groups based on their membership in a particular protected class, with a clear intent to promote inequality.
-- Direct Hateful Insults: Content that directly addresses another person or group of people the second person (e.g. ""You over there"") and insults them based on their membership in a particular protected class
 #### Excludes
-- Artistic and Educational Content: Expressions intended for artistic, educational, or documentary purposes that discuss sensitive topics but do not advocate for violence or discrimination against individuals or groups based on their membership in a particular protected class.
-- Political and Social Commentary: Commentary on political issues, social issues, and political ideologies that does not directly incite violence or discrimination against individuals or groups based on their membership in a particular protected class.
-- Rebutting Hateful Language: Content that rebuts, condemns, questions, criticizes, or mocks a different person's hateful language or ideas OR that insults the person advocating those hateful
-- Quoting Hateful Language: Content in which the author quotes someone else's hateful language or ideas while discussing, explaining, or neutrally factually presenting those ideas.
-- Describing Sectarian Violence: Content that describes, but does not endorse or praise, violent physical injury against a specifically named race, ethnicity, nationality, sexual orientation, or religious community by another specifically named race, ethnicity, nationality, sexual orientation, or religious community
 """
 DEFAULT_CONTENT = "LLMs steal our jobs."

 DEFAULT_POLICY = """# Policy
+## Overview
 This policy is designed to determine whether or not content is hate speech.
 ## Definition of Terms
+- Protected Classes: Individuals or groups distinguished by race (e.g. Black, White), ethnicity (e.g. Hispanic), nationality (e.g. Canadian, Argentinian), caste (e.g. Brahmins, Kshatriyas, Vaishyas), immigration status (e.g. Immigrants), religion (e.g. Catholics, Jews), disability (e.g. blind people, deaf people), disease status (e.g. people with AIDS), sexual orientation (e.g. gays, bisexuals), transgender status (e.g. trans people, cis people), gender identity (e.g. women, men), gender expression (e.g. masculine, feminine), or sex (e.g. male, female).
+- Derogatory Slur: Any term or phrase used pejoratively or contemptuously towards a member or members of a protected class, including those historically reclaimed but used in a derogatory context.
+- Non-Protected Classes: Categories defined by, political party membership (e.g. Democrats, Tories), political beliefs (e.g. fascists, communists), hobbies (e.g. games, knitters), criminal history (e.g. criminals, convicts), or profession (e.g. doctors, farmers).
+## Interpretation of Language:
+- Terminology or phrasing that can be construed in various ways IS hate speech UNLESS context clearly indicates it unambiguously IS NOT hate speech. So, for example, attacks on a religion as an institution (e.g. "Hinduism is a garbage religion") SHOULD be understood as attacks on the followers of that religion.
 ## Definition of Labels
 #### Includes
+- Direct and Indirect Violence Advocacy: Any form of expression that celebrates, justifies, incites, or threatens physical or psychological harm against individuals or groups identified by their membership in a protected class.
+- Dehumanization and Vilification: Content suggesting that individuals or groups, explicitly identified by their membership in a specific protected class, are subhuman, inherently lesser beings, or using animalistic descriptions in a way that promotes disdain or hate.
+- Derogatory and Dehumanizing Language: Use of slurs, epithets, or any derogatory language aimed at belittling, humiliating, or inciting hatred against individuals or groups explicitly identified by their membership in a specific protected class.
+- Explicit and Implicit Discrimination Advocacy: Promoting exclusion, segregation, or denial of rights against  individuals or groups explicitly identified by their membership in a specific protected class.
+- Collective Attribution of Negative Actions: Assigning collective blame or advocating collective punishment based on the actions or perceived characteristics of individuals or groups identified by their membership in a protected class.
+- Inferiority and Superiority Claims: Statements that categorically assign inferiority or superiority, moral or intellectual, to individuals or groups identified by their membership in a protected class.
+- Denial or Distortion of Historical Atrocities: Denying, grossly trivializing, or distorting documented atrocities against groups identified by their membership in a protected class, undermining their significance or the suffering of their members.
+- Conspiracy Theories: Propagating unfounded allegations that individuals or groups, identified by their membership in a protected class, are responsible for serious harms or controlling significant institutions to the detriment of society.
 #### Excludes
+- Attacks on Non-Protected Classes: Content that attacks or criticizes individuals or groups identified by their membership in a Non-Protected Class, EVEN if that attack is violent, threatening, or otherwise hateful (e.g. "Criminals should all be rounded up and shot!").
+- Criticism of Beliefs and Institutions: Constructive critique or discussion of political ideologies, religious doctrines, or institutions without resorting to hate speech or targeting individuals or groups identified by their membership in a protected class.
+- Neutrally Reporting Historical Events: Neutrally and descriptively reporting or discussion of factual events in the past that could be construed as negative about individuals or groups identified by their membership in a protected class.
+- Pushing Back on Hateful Language: Content where the writer pushes back on, condemns, questions, criticizes, or mocks a different person's hateful language or ideas.
+- Artistic and Educational Content: Content with legitimate artistic, educational, or documentary value that discusses or portrays issues related to hate speech in a context clearly aimed at enlightening or informing without promoting hate.
 """
 DEFAULT_CONTENT = "LLMs steal our jobs."