--- license: apache-2.0 --- # Model Card for RoBERTa Social Roles Classifier This model is a token classifier that extracts social roles from explicit expressions of self-identification in sentences, e.g. *I am a **designer**, **entrepreneur**, and **mother***. ## Model Details ### Model Description We continue pretraining [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base) for 10 epochs on individuals' `about` pages, which is a subset of Common Crawl and can be accessed [here](https://huggingface.co/datasets/allenai/aboutme). Then, we finetuned on hand-annotated token-level labels as described in [this paper](https://arxiv.org/abs/2401.06408). We use a train-dev-test split of 600/200/200 labeled sentences. Our definition of "roles" or "occupations" on `about` pages is any singular noun referring to the subject of the bio. The roles and occupations can be ones that the subject actively participated in the past, e.g. *Throughout my life I have been a **teacher**, a startup **founder**, and a seashell **collector***. Subject of the `about` page - First person biographies: the subject is I, me, my, mine. - Third person biographies: we assume the bio’s subject is the main person referenced in the excerpt sentence. Positive examples of self-identification - I am a **chef**, **author**, and **mom** living in Virginia. - As an award-winning **geologist**, Sebastian has given talks around the world. - **Knitter**, **blogger**, & **dreamer**. In the last example above, the sentence’s relation to the subject of the bio is implied rather than stated. Negative examples - My **wife** loves beekeeping as well. - Janice works hard to accommodate every **client**. **Language(s) (NLP):** English **License:** Apache 2.0 ## Uses We use tagged social roles in web pages to assess the social impact of LLM pretraining data curation decisions. Text linked to descriptions of their creators can also facilitate other areas of research, including self-presentation and language variation. ## Evaluation On our test set, we achieve a precision score of 0.856, recall score of 0.945, and F1 score of 0.898. ## Citation ``` @misc{lucy2024aboutme, title={AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters}, author={Li Lucy and Suchin Gururangan and Luca Soldaini and Emma Strubell and David Bamman and Lauren Klein and Jesse Dodge}, year={2024}, eprint={2401.06408}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Contact `lucy3_li@berkeley.edu`