|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# Model Card for RoBERTa Social Roles Classifier |
|
|
|
This model is a token classifier that extracts social roles from explicit expressions of self-identification in sentences, e.g. *I am a **designer**, |
|
**entrepreneur**, and **mother***. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
We continue pretraining [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base) for 10 epochs on individuals' `about` pages, which is a subset of Common Crawl and can be accessed [here](https://huggingface.co/datasets/allenai/aboutme). |
|
|
|
Then, we finetuned on hand-annotated token-level labels as described in [this paper](https://arxiv.org/abs/2401.06408). We use a train-dev-test split of 600/200/200 labeled sentences. |
|
|
|
Our definition of "roles" or "occupations" on `about` pages is any singular noun referring to the subject of the bio. The roles and occupations can be ones that the subject actively participated in the past, e.g. *Throughout my life I have been a **teacher**, a startup **founder**, and a seashell **collector***. |
|
|
|
Subject of the `about` page |
|
- First person biographies: the subject is I, me, my, mine. |
|
- Third person biographies: we assume the bio’s subject is the main person referenced in the |
|
excerpt sentence. |
|
|
|
Positive examples of self-identification |
|
- I am a **chef**, **author**, and **mom** living in Virginia. |
|
- As an award-winning **geologist**, Sebastian has given talks around the world. |
|
- **Knitter**, **blogger**, & **dreamer**. |
|
In the last example above, the sentence’s relation to the subject of the bio is implied rather than |
|
stated. |
|
|
|
Negative examples |
|
- My **wife** loves beekeeping as well. |
|
- Janice works hard to accommodate every **client**. |
|
|
|
|
|
**Language(s) (NLP):** English |
|
|
|
**License:** Apache 2.0 |
|
|
|
## Uses |
|
|
|
We use tagged social roles in web pages to assess the social impact of LLM pretraining data curation decisions. Text linked to descriptions of their creators can also |
|
facilitate other areas of research, including self-presentation and language variation. |
|
|
|
## Evaluation |
|
|
|
On our test set, we achieve a precision score of 0.856, recall score of 0.945, and F1 score of 0.898. |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{lucy2024aboutme, |
|
title={AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters}, |
|
author={Li Lucy and Suchin Gururangan and Luca Soldaini and Emma Strubell and David Bamman and Lauren Klein and Jesse Dodge}, |
|
year={2024}, |
|
eprint={2401.06408}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
## Contact |
|
|
|
`[email protected]` |