lucy3
/

roberta_social_roles

Token Classification

Inference Endpoints

Model card Files Files and versions Community

lucy3 commited on Jun 3, 2024

Commit

5c3b1ee

•

1 Parent(s): 8e17b62

Update README.md

Files changed (1) hide show

README.md +65 -3

README.md CHANGED Viewed

@@ -1,3 +1,65 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# Model Card for RoBERTa Social Roles Classifier
+This model is a token classifier that extracts social roles from explicit expressions of self-identification in sentences, e.g. *I am a **designer**,
+**entrepreneur**, and **mother***.
+## Model Details
+### Model Description
+We continue pretraining [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base) for 10 epochs on individuals' `about` pages, which is a subset of Common Crawl and can be accessed [here](https://huggingface.co/datasets/allenai/aboutme).
+Then, we finetuned on hand-annotated token-level labels as described in [this paper](https://arxiv.org/abs/2401.06408). We use a train-dev-test split of 600/200/200 labeled sentences.
+Our definition of "roles" or "occupations" on `about` pages is any singular noun referring to the subject of the bio. The roles and occupations can be ones that the subject actively participated in the past, e.g. *Throughout my life I have been a **teacher**, a startup **founder**, and a seashell **collector***.
+Subject of the `about` page
+- First person biographies: the subject is I, me, my, mine.
+- Third person biographies: we assume the bio’s subject is the main person referenced in the
+excerpt sentence.
+Positive examples of self-identification
+- I am a **chef**, **author**, and **mom** living in Virginia.
+- As an award-winning **geologist**, Sebastian has given talks around the world.
+- **Knitter**, **blogger**, & **dreamer**.
+In the last example above, the sentence’s relation to the subject of the bio is implied rather than
+stated.
+Negative examples
+- My **wife** loves beekeeping as well.
+- Janice works hard to accommodate every **client**.
+**Language(s) (NLP):** English
+**License:** Apache 2.0
+## Uses
+We use tagged social roles in web pages to assess the social impact of LLM pretraining data curation decisions. Text linked to descriptions of their creators can also
+facilitate other areas of research, including self-presentation and language variation.
+## Evaluation
+On our test set, we achieve a precision score of 0.856, recall score of 0.945, and F1 score of 0.898.
+## Citation
+```
+@misc{lucy2024aboutme,
+      title={AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters},
+      author={Li Lucy and Suchin Gururangan and Luca Soldaini and Emma Strubell and David Bamman and Lauren Klein and Jesse Dodge},
+      year={2024},
+      eprint={2401.06408},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+## Contact
+`[email protected]`