bloomz-3b-guardrail / README.md
Cyrile's picture
Update README.md
7971558
|
raw
history blame
6.41 kB
metadata
license: bigscience-bloom-rail-1.0
language:
  - fr
  - en
pipeline_tag: text-classification

Bloomz-3b-guardrail

We introduce the Bloomz-3b-guardrail model, which is a fine-tuning of the Bloomz-3b-sft-chat model. This model is designed to detect the toxicity of a text in five modes:

  • Obscene: Content that is offensive, indecent, or morally inappropriate, especially in relation to social norms or standards of decency.
  • Sexual explicit: Content that presents explicit sexual aspects in a clear and detailed manner.
  • Identity attack: Content that aims to attack, denigrate, or harass someone based on their identity, especially related to characteristics such as race, gender, sexual orientation, religion, ethnic origin, or other personal aspects.
  • Insult: Offensive, disrespectful, or hurtful content used to attack or denigrate a person.
  • Threat: Content that presents a direct threat to an individual.

This kind of modeling can be ideal for monitoring and controlling the output of generative models, as well as measuring the generated degree of toxicity.

Training

The training dataset consists of 500k examples of comments in English and 500k comments in French (translated by Google Translate), each annotated with a toxicity severity graduation. The dataset used is provided by Jigsaw as part of a Kaggle competition : Jigsaw Unintended Bias in Toxicity Classification. Since the scores represent severity graduations, regression was preferred using the following loss function: loss=lobscene+lsexual_explicit+lidentity_attack+linsult+lthreatloss=l_{\mathrm{obscene}}+l_{\mathrm{sexual\_explicit}}+l_{\mathrm{identity\_attack}}+l_{\mathrm{insult}}+l_{\mathrm{threat}} with li=1OoOscorei,oσ(logiti,o)l_i=\frac{1}{\vert\mathcal{O}\vert}\sum_{o\in\mathcal{O}}\vert\mathrm{score}_{i,o}-\sigma(\mathrm{logit}_{i,o})\vert Where sigma is the sigmoid function and O represents the set of learning observations.

Benchmark

As the scores range from 0 to 1, a performance measure such as RMSE may be challenging to interpret. Therefore, Pearson's inter-correlation was chosen as a measure. Pearson's inter-correlation is a measure ranging from -1 to 1, where 0 represents no correlation, -1 represents perfect negative correlation, and 1 represents perfect positive correlation. The goal is to quantitatively measure the correlation between the model's scores and the scores assigned by judges for 730 comments not seen during training.

Model Language Obsecene (x100) Sexual explicit (x100) Identity attack (x100) Insult (x100) Threat (x100) Mean
Bloomz-560m-guardrail French 62 73 73 68 61 67
Bloomz-560m-guardrail English 63 61 63 67 55 62
Bloomz-3b-guardrail French 72 82 80 78 77 78
Bloomz-3b-guardrail English 76 78 77 75 79 77

With a correlation of approximately 65 for the 560m model and approximately 80 for the 3b model, the output is highly correlated with the judges' scores.

Now we will focus on the MAE (Mean Absolute Error) score to measure the average gap of the estimation error with the error standard deviation.

Model Language Obsecene Sexual explicit Identity attack Insult Threat Mean
Bloomz-560m-guardrail French 0.06 ± 0.09 0.03 ± 0.07 0.03 ± 0.07 0.13 ± 0.13 0.04 ± 0.06 0.06 ± 0.08
Bloomz-560m-guardrail English 0.06 ± 0.09 0.03 ± 0.08 0.03 ± 0.08 0.14 ± 0.13 0.04 ± 0.07 0.06 ± 0.09
Bloomz-3b-guardrail French 0.05 ± 0.08 0.02 ± 0.06 0.02 ± 0.06 0.11 ± 0.11 0.03 ± 0.05 0.05 ± 0.07
Bloomz-3b-guardrail English 0.05 ± 0.08 0.03 ± 0.07 0.02 ± 0.06 0.12 ± 0.11 0.03 ± 0.05 0.05 ± 0.07

How to Use Blommz-3b-guardrail

The following example utilizes the API Pipeline of the Transformers library.

from transformers import pipeline

guardrail = pipeline("text-classification", "cmarkea/bloomz-3b-guardrail")

list_text = [...]
result = guardrail(
    list_text,
    return_all_scores=True, # Crucial for assessing all modalities of toxicity!
    function_to_apply='sigmoid' # To ensure obtaining a score between 0 and 1!
)

Citation

@online{DeBloomzGuard,
  AUTHOR = {Cyrile Delestre},
  ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-guardrail},
  YEAR = {2023},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}