Spaces:
Runtime error
Runtime error
Update Space (evaluate main: 1bb5f431)
Browse files
README.md
CHANGED
@@ -1,12 +1,146 @@
|
|
1 |
---
|
2 |
title: Regard
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
-
sdk_version: 3.
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
|
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
title: Regard
|
3 |
+
emoji: 🤗
|
4 |
+
colorFrom: green
|
5 |
+
colorTo: purple
|
6 |
sdk: gradio
|
7 |
+
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
tags:
|
11 |
+
- evaluate
|
12 |
+
- measurement
|
13 |
+
description: >-
|
14 |
+
Regard aims to measure language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
|
15 |
---
|
16 |
|
17 |
+
# Measurement Card for Regard
|
18 |
+
|
19 |
+
|
20 |
+
## Measurement Description
|
21 |
+
|
22 |
+
The `regard` measurement returns the estimated language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
|
23 |
+
|
24 |
+
It uses a model trained on labelled data from the paper ["The Woman Worked as a Babysitter: On Biases in Language Generation" (EMNLP 2019)](https://arxiv.org/abs/1909.01326)
|
25 |
+
|
26 |
+
## How to Use
|
27 |
+
|
28 |
+
This measurement requires two lists of strings as input, enabling comparing the estimated polarity between the groups.
|
29 |
+
|
30 |
+
```python
|
31 |
+
>>> regard = evaluate.load("regard", module_type="measurement")
|
32 |
+
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
|
33 |
+
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
|
34 |
+
>>> regard.compute(data = group1, references = group2)
|
35 |
+
```
|
36 |
+
|
37 |
+
### Inputs
|
38 |
+
- **data** (list of `str`): prediction/candidate sentences, e.g. sentences describing a given demographic group.
|
39 |
+
- **references** (list of `str`) (optional): reference/comparison sentences, e.g. sentences describing a different demographic group to compare against.
|
40 |
+
- **aggregation** (`str`) (optional): determines the type of aggregation performed.
|
41 |
+
If set to `None`, the difference between the regard scores for the two categories is returned.
|
42 |
+
Otherwise:
|
43 |
+
- `average` : returns the average regard for each category (negative, positive, neutral, other) for each group
|
44 |
+
- `maximum`: returns the maximum regard for each group
|
45 |
+
|
46 |
+
### Output Values
|
47 |
+
|
48 |
+
**With a single input**:
|
49 |
+
|
50 |
+
`regard` : the regard scores of each string in the input list (if no aggregation is specified)
|
51 |
+
```python
|
52 |
+
{'neutral': 0.95, 'positive': 0.02, 'negative': 0.02, 'other': 0.01}
|
53 |
+
{'negative': 0.97, 'other': 0.02, 'neutral': 0.01, 'positive': 0.0}
|
54 |
+
```
|
55 |
+
|
56 |
+
`average_regard`: the average regard for each category (negative, positive, neutral, other) (if `aggregation` = `average`)
|
57 |
+
```python
|
58 |
+
{'neutral': 0.48, 'positive': 0.01, 'negative': 0.5, 'other': 0.01}
|
59 |
+
```
|
60 |
+
|
61 |
+
`max_regard`: the maximum regard across all input strings (if `aggregation` = `maximum`)
|
62 |
+
```python
|
63 |
+
{'neutral': 0.95, 'positive': 0.024, 'negative': 0.972, 'other': 0.019}
|
64 |
+
```
|
65 |
+
|
66 |
+
**With two lists of inputs**:
|
67 |
+
|
68 |
+
By default, this measurement outputs a dictionary containing a list of regard scores, one for each category (negative, positive, neutral, other), representing the difference in regard between the two groups.
|
69 |
+
|
70 |
+
```python
|
71 |
+
{'neutral': 0.35, 'negative': -0.36, 'other': 0.01, 'positive': 0.01}
|
72 |
+
```
|
73 |
+
|
74 |
+
With the `aggregation='maximum'` option, this measurement will output the maximum regard for each group:
|
75 |
+
|
76 |
+
```python
|
77 |
+
{'negative': 0.98, 'other': 0.04, 'neutral': 0.03, 'positive': 0.0}
|
78 |
+
```
|
79 |
+
|
80 |
+
With the `aggregation='average'` option, this measurement will output the average regard for each category (negative, positive, neutral, other):
|
81 |
+
|
82 |
+
```python
|
83 |
+
{'neutral': 0.37, 'negative': 0.57, 'other': 0.05, 'positive': 0.01}
|
84 |
+
```
|
85 |
+
|
86 |
+
### Examples
|
87 |
+
|
88 |
+
Example 1 (single input):
|
89 |
+
|
90 |
+
```python
|
91 |
+
>>> regard = evaluate.load("regard")
|
92 |
+
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
|
93 |
+
>>> results = regard.compute(data = group1)
|
94 |
+
>>> for d in results['regard']:
|
95 |
+
... print({l['label']: round(l['score'],2) for l in d})
|
96 |
+
{'neutral': 0.95, 'positive': 0.02, 'negative': 0.02, 'other': 0.01}
|
97 |
+
{'negative': 0.97, 'other': 0.02, 'neutral': 0.01, 'positive': 0.0}
|
98 |
+
```
|
99 |
+
|
100 |
+
Example 2 (comparison mode):
|
101 |
+
```python
|
102 |
+
>>> regard = evaluate.load("regard", "compare")
|
103 |
+
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
|
104 |
+
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
|
105 |
+
>>> results = regard.compute(data = group1, references = group2)
|
106 |
+
>>> print({k: round(v, 2) for k, v in results['regard_difference'].items()})
|
107 |
+
{'neutral': 0.46, 'positive': 0.01, 'negative': -0.46, 'other': -0.01}
|
108 |
+
```
|
109 |
+
|
110 |
+
Example 3 (returns the maximum regard score):
|
111 |
+
```python
|
112 |
+
>>> regard = evaluate.load("regard", "compare")
|
113 |
+
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
|
114 |
+
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
|
115 |
+
>>> results = regard.compute(data = group1, references = group2, aggregation = "maximum")
|
116 |
+
>>> print({k: round(v, 2) for k, v in results['max_data_regard'].items()})
|
117 |
+
{'neutral': 0.95, 'positive': 0.02, 'negative': 0.97, 'other': 0.02}
|
118 |
+
>>> print({k: round(v, 2) for k, v in results['max_references_regard'].items()})
|
119 |
+
{'negative': 0.98, 'other': 0.04, 'neutral': 0.03, 'positive': 0.0}
|
120 |
+
```
|
121 |
+
|
122 |
+
Example 4 (returns the average regard score):
|
123 |
+
```python
|
124 |
+
>>> regard = evaluate.load("regard", "compare")
|
125 |
+
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
|
126 |
+
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
|
127 |
+
>>> results = regard.compute(data = group1, references = group2, aggregation = "average")
|
128 |
+
>>> print({k: round(v, 2) for k, v in results['average_data_regard'].items()})
|
129 |
+
{'neutral': 0.48, 'positive': 0.01, 'negative': 0.5, 'other': 0.01}
|
130 |
+
>>> print({k: round(v, 2) for k, v in results['average_references_regard'].items()})
|
131 |
+
{'negative': 0.96, 'other': 0.02, 'neutral': 0.02, 'positive': 0.0}
|
132 |
+
```
|
133 |
+
|
134 |
+
## Citation(s)
|
135 |
+
@article{https://doi.org/10.48550/arxiv.1909.01326,
|
136 |
+
doi = {10.48550/ARXIV.1909.01326},
|
137 |
+
url = {https://arxiv.org/abs/1909.01326},
|
138 |
+
author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
|
139 |
+
title = {The Woman Worked as a Babysitter: On Biases in Language Generation},
|
140 |
+
publisher = {arXiv},
|
141 |
+
year = {2019}
|
142 |
+
}
|
143 |
+
|
144 |
+
|
145 |
+
## Further References
|
146 |
+
- [`nlg-bias` library](https://github.com/ewsheng/nlg-bias/)
|
app.py
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import evaluate
|
2 |
+
from evaluate.utils import launch_gradio_widget
|
3 |
+
|
4 |
+
|
5 |
+
module = evaluate.load("regard")
|
6 |
+
launch_gradio_widget(module)
|
regard.py
ADDED
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Copyright 2020 The HuggingFace Evaluate Authors.
|
2 |
+
#
|
3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
4 |
+
# you may not use this file except in compliance with the License.
|
5 |
+
# You may obtain a copy of the License at
|
6 |
+
#
|
7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
8 |
+
#
|
9 |
+
# Unless required by applicable law or agreed to in writing, software
|
10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
+
# See the License for the specific language governing permissions and
|
13 |
+
# limitations under the License.
|
14 |
+
|
15 |
+
""" Regard measurement. """
|
16 |
+
|
17 |
+
from collections import defaultdict
|
18 |
+
from operator import itemgetter
|
19 |
+
from statistics import mean
|
20 |
+
|
21 |
+
import datasets
|
22 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
|
23 |
+
|
24 |
+
import evaluate
|
25 |
+
|
26 |
+
|
27 |
+
logger = evaluate.logging.get_logger(__name__)
|
28 |
+
|
29 |
+
|
30 |
+
_CITATION = """
|
31 |
+
@article{https://doi.org/10.48550/arxiv.1909.01326,
|
32 |
+
doi = {10.48550/ARXIV.1909.01326},
|
33 |
+
url = {https://arxiv.org/abs/1909.01326},
|
34 |
+
author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
|
35 |
+
title = {The Woman Worked as a Babysitter: On Biases in Language Generation},
|
36 |
+
publisher = {arXiv},
|
37 |
+
year = {2019}
|
38 |
+
}
|
39 |
+
|
40 |
+
"""
|
41 |
+
|
42 |
+
_DESCRIPTION = """\
|
43 |
+
Regard aims to measure language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
|
44 |
+
"""
|
45 |
+
|
46 |
+
_KWARGS_DESCRIPTION = """
|
47 |
+
Compute the regard of the input sentences.
|
48 |
+
|
49 |
+
Args:
|
50 |
+
`data` (list of str): prediction/candidate sentences, e.g. sentences describing a given demographic group.
|
51 |
+
`references` (list of str) (optional): reference/comparison sentences, e.g. sentences describing a different demographic group to compare against.
|
52 |
+
`aggregation` (str) (optional): determines the type of aggregation performed.
|
53 |
+
If set to `None`, the difference between the regard scores for the two categories is returned.
|
54 |
+
Otherwise:
|
55 |
+
- 'average' : returns the average regard for each category (negative, positive, neutral, other) for each group
|
56 |
+
- 'maximum': returns the maximum regard for each group
|
57 |
+
|
58 |
+
Returns:
|
59 |
+
With only `data` as input (default config):
|
60 |
+
`regard` : the regard scores of each string in the input list (if no aggregation is specified)
|
61 |
+
`average_regard`: the average regard for each category (negative, positive, neutral, other) (if `aggregation` = `average`)
|
62 |
+
`max_regard`: the maximum regard across all input strings (if `aggregation` = `maximum`)
|
63 |
+
With `data` and `references` as input (`compare` config):
|
64 |
+
`regard_difference`: the difference between the regard scores for the two groups (if no aggregation is specified)
|
65 |
+
`average_data_regard` and 'average_references_regard': the average regard for each category (negative, positive, neutral, other) (if `aggregation` = `average`)
|
66 |
+
`max_data_regard` and 'max_references_regard': the maximum regard for each group (if `aggregation` = `maximum`)
|
67 |
+
|
68 |
+
Examples:
|
69 |
+
|
70 |
+
Example 1 (single input):
|
71 |
+
>>> regard = evaluate.load("regard")
|
72 |
+
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
|
73 |
+
>>> results = regard.compute(data = group1)
|
74 |
+
>>> for d in results['regard']:
|
75 |
+
... print({l['label']: round(l['score'],2) for l in d})
|
76 |
+
{'neutral': 0.95, 'positive': 0.02, 'negative': 0.02, 'other': 0.01}
|
77 |
+
{'negative': 0.97, 'other': 0.02, 'neutral': 0.01, 'positive': 0.0}
|
78 |
+
|
79 |
+
Example 2 (comparison mode):
|
80 |
+
>>> regard = evaluate.load("regard", "compare")
|
81 |
+
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
|
82 |
+
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
|
83 |
+
>>> results = regard.compute(data = group1, references = group2)
|
84 |
+
>>> print({k: round(v, 2) for k, v in results['regard_difference'].items()})
|
85 |
+
{'neutral': 0.46, 'positive': 0.01, 'negative': -0.46, 'other': -0.01}
|
86 |
+
|
87 |
+
Example 3 (returns the maximum regard score per category):
|
88 |
+
>>> regard = evaluate.load("regard", "compare")
|
89 |
+
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
|
90 |
+
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
|
91 |
+
>>> results = regard.compute(data = group1, references = group2, aggregation = "maximum")
|
92 |
+
>>> print({k: round(v, 2) for k, v in results['max_data_regard'].items()})
|
93 |
+
{'neutral': 0.95, 'positive': 0.02, 'negative': 0.97, 'other': 0.02}
|
94 |
+
>>> print({k: round(v, 2) for k, v in results['max_references_regard'].items()})
|
95 |
+
{'negative': 0.98, 'other': 0.04, 'neutral': 0.03, 'positive': 0.0}
|
96 |
+
|
97 |
+
Example 4 (returns the average regard score):
|
98 |
+
>>> regard = evaluate.load("regard", "compare")
|
99 |
+
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
|
100 |
+
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
|
101 |
+
>>> results = regard.compute(data = group1, references = group2, aggregation = "average")
|
102 |
+
>>> print({k: round(v, 2) for k, v in results['average_data_regard'].items()})
|
103 |
+
{'neutral': 0.48, 'positive': 0.01, 'negative': 0.5, 'other': 0.01}
|
104 |
+
>>> print({k: round(v, 2) for k, v in results['average_references_regard'].items()})
|
105 |
+
{'negative': 0.96, 'other': 0.02, 'neutral': 0.02, 'positive': 0.0}
|
106 |
+
"""
|
107 |
+
|
108 |
+
|
109 |
+
def regard(group, regard_classifier):
|
110 |
+
group_scores = defaultdict(list)
|
111 |
+
group_regard = regard_classifier(group)
|
112 |
+
for pred in group_regard:
|
113 |
+
for pred_score in pred:
|
114 |
+
group_scores[pred_score["label"]].append(pred_score["score"])
|
115 |
+
return group_regard, dict(group_scores)
|
116 |
+
|
117 |
+
|
118 |
+
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
119 |
+
class Regard(evaluate.Measurement):
|
120 |
+
def _info(self):
|
121 |
+
if self.config_name not in ["compare", "default"]:
|
122 |
+
raise KeyError("You should supply a configuration name selected in " '["config", "default"]')
|
123 |
+
return evaluate.MeasurementInfo(
|
124 |
+
module_type="measurement",
|
125 |
+
description=_DESCRIPTION,
|
126 |
+
citation=_CITATION,
|
127 |
+
inputs_description=_KWARGS_DESCRIPTION,
|
128 |
+
features=datasets.Features(
|
129 |
+
{
|
130 |
+
"data": datasets.Value("string", id="sequence"),
|
131 |
+
"references": datasets.Value("string", id="sequence"),
|
132 |
+
}
|
133 |
+
if self.config_name == "compare"
|
134 |
+
else {
|
135 |
+
"data": datasets.Value("string", id="sequence"),
|
136 |
+
}
|
137 |
+
),
|
138 |
+
codebase_urls=[],
|
139 |
+
reference_urls=[],
|
140 |
+
)
|
141 |
+
|
142 |
+
def _download_and_prepare(self, dl_manager):
|
143 |
+
regard_tokenizer = AutoTokenizer.from_pretrained("sasha/regardv3")
|
144 |
+
regard_model = AutoModelForSequenceClassification.from_pretrained("sasha/regardv3")
|
145 |
+
self.regard_classifier = pipeline(
|
146 |
+
"text-classification", model=regard_model, top_k=4, tokenizer=regard_tokenizer, truncation=True
|
147 |
+
)
|
148 |
+
|
149 |
+
def _compute(
|
150 |
+
self,
|
151 |
+
data,
|
152 |
+
references=None,
|
153 |
+
aggregation=None,
|
154 |
+
):
|
155 |
+
if self.config_name == "compare":
|
156 |
+
pred_scores, pred_regard = regard(data, self.regard_classifier)
|
157 |
+
ref_scores, ref_regard = regard(references, self.regard_classifier)
|
158 |
+
pred_mean = {k: mean(v) for k, v in pred_regard.items()}
|
159 |
+
pred_max = {k: max(v) for k, v in pred_regard.items()}
|
160 |
+
ref_mean = {k: mean(v) for k, v in ref_regard.items()}
|
161 |
+
ref_max = {k: max(v) for k, v in ref_regard.items()}
|
162 |
+
if aggregation == "maximum":
|
163 |
+
return {
|
164 |
+
"max_data_regard": pred_max,
|
165 |
+
"max_references_regard": ref_max,
|
166 |
+
}
|
167 |
+
elif aggregation == "average":
|
168 |
+
return {"average_data_regard": pred_mean, "average_references_regard": ref_mean}
|
169 |
+
else:
|
170 |
+
return {"regard_difference": {key: pred_mean[key] - ref_mean.get(key, 0) for key in pred_mean}}
|
171 |
+
else:
|
172 |
+
pred_scores, pred_regard = regard(data, self.regard_classifier)
|
173 |
+
pred_mean = {k: mean(v) for k, v in pred_regard.items()}
|
174 |
+
pred_max = {k: max(v) for k, v in pred_regard.items()}
|
175 |
+
if aggregation == "maximum":
|
176 |
+
return {"max_regard": pred_max}
|
177 |
+
elif aggregation == "average":
|
178 |
+
return {"average_regard": pred_mean}
|
179 |
+
else:
|
180 |
+
return {"regard": pred_scores}
|
requirements.txt
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
git+https://github.com/huggingface/evaluate.git@1bb5f431d16a789950784660b26c650e1ab0e3cc
|
2 |
+
transformers
|