lucy3 commited on
Commit
5c3b1ee
1 Parent(s): 8e17b62

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -3
README.md CHANGED
@@ -1,3 +1,65 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # Model Card for RoBERTa Social Roles Classifier
6
+
7
+ This model is a token classifier that extracts social roles from explicit expressions of self-identification in sentences, e.g. *I am a **designer**,
8
+ **entrepreneur**, and **mother***.
9
+
10
+ ## Model Details
11
+
12
+ ### Model Description
13
+
14
+ We continue pretraining [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base) for 10 epochs on individuals' `about` pages, which is a subset of Common Crawl and can be accessed [here](https://huggingface.co/datasets/allenai/aboutme).
15
+
16
+ Then, we finetuned on hand-annotated token-level labels as described in [this paper](https://arxiv.org/abs/2401.06408). We use a train-dev-test split of 600/200/200 labeled sentences.
17
+
18
+ Our definition of "roles" or "occupations" on `about` pages is any singular noun referring to the subject of the bio. The roles and occupations can be ones that the subject actively participated in the past, e.g. *Throughout my life I have been a **teacher**, a startup **founder**, and a seashell **collector***.
19
+
20
+ Subject of the `about` page
21
+ - First person biographies: the subject is I, me, my, mine.
22
+ - Third person biographies: we assume the bio’s subject is the main person referenced in the
23
+ excerpt sentence.
24
+
25
+ Positive examples of self-identification
26
+ - I am a **chef**, **author**, and **mom** living in Virginia.
27
+ - As an award-winning **geologist**, Sebastian has given talks around the world.
28
+ - **Knitter**, **blogger**, & **dreamer**.
29
+ In the last example above, the sentence’s relation to the subject of the bio is implied rather than
30
+ stated.
31
+
32
+ Negative examples
33
+ - My **wife** loves beekeeping as well.
34
+ - Janice works hard to accommodate every **client**.
35
+
36
+
37
+ **Language(s) (NLP):** English
38
+
39
+ **License:** Apache 2.0
40
+
41
+ ## Uses
42
+
43
+ We use tagged social roles in web pages to assess the social impact of LLM pretraining data curation decisions. Text linked to descriptions of their creators can also
44
+ facilitate other areas of research, including self-presentation and language variation.
45
+
46
+ ## Evaluation
47
+
48
+ On our test set, we achieve a precision score of 0.856, recall score of 0.945, and F1 score of 0.898.
49
+
50
+ ## Citation
51
+
52
+ ```
53
+ @misc{lucy2024aboutme,
54
+ title={AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters},
55
+ author={Li Lucy and Suchin Gururangan and Luca Soldaini and Emma Strubell and David Bamman and Lauren Klein and Jesse Dodge},
56
+ year={2024},
57
+ eprint={2401.06408},
58
+ archivePrefix={arXiv},
59
+ primaryClass={cs.CL}
60
+ }
61
+ ```
62
+
63
+ ## Contact
64
+
65