KidLM-plus / README.md
tafseer-nayeem's picture
Update README.md
b94e97c verified
---
license: apache-2.0
datasets:
- tafseer-nayeem/KidLM-corpus
language:
- en
base_model:
- FacebookAI/roberta-base
pipeline_tag: fill-mask
library_name: transformers
---
## KidLM (plus) Model
We continue to pre-train the [RoBERTa (base)](https://huggingface.co/FacebookAI/roberta-base) model on our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus) using a masked language modeling (MLM) objective. The KidLM (plus) model introduces a masking strategy called **Stratified Masking**, which varies the probability of masking based on word classes. This approach enhances the model's focus on tokens that are more informative and specifically tailored to children's language needs, aiming to steer language model predictions towards child-specific vocabulary derived from our high-quality [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus).
To achieve this, Stratified Masking is introduced based on **two key principles**:
1. All words in our corpus have a non-zero probability of being masked.
2. Words more commonly found in a general corpus are masked with a lower probability.
Based on these principles, each word in our corpus is assigned to one of the following **three strata**:
- **Stopwords**: These are the most frequent words in the language. We apply a **0.15** masking rate to these words.
- **Dale-Chall Easy Words**: To prioritize linguistic simplicity specific to children, we apply a slightly higher masking rate of **0.20** to these words.
- **Other Words**: This category often includes nouns and entities that reflect children's interests and preferences. We assign a higher masking rate of **0.25** to emphasize their informative importance during training.
For more details, please refer to our [EMNLP 2024 paper](https://aclanthology.org/2024.emnlp-main.277/).
## How to use
You can use this model directly with a pipeline for masked language modeling:
```python
from transformers import pipeline
fill_mask_kidLM_plus = pipeline(
"fill-mask",
model="tafseer-nayeem/KidLM-plus",
top_k=5
)
prompt = "On my birthday, I want <mask>."
predictions_kidLM_plus = fill_mask_kidLM_plus(prompt)
print(predictions_kidLM_plus)
```
**Outputs:**
```JSON
[
{'score': 0.5298162698745728,
'token': 7548,
'token_str': 'chocolate',
'sequence': 'On my birthday, I want chocolate.'},
{'score': 0.08184309303760529,
'token': 8492,
'token_str': 'cake',
'sequence': 'On my birthday, I want cake.'},
{'score': 0.033250316977500916,
'token': 12644,
'token_str': 'candy',
'sequence': 'On my birthday, I want candy.'},
{'score': 0.03274081274867058,
'token': 2690,
'token_str': 'stars',
'sequence': 'On my birthday, I want stars.'},
{'score': 0.024002602323889732,
'token': 27116,
'token_str': 'puppies',
'sequence': 'On my birthday, I want puppies.'}
]
```
## Limitations and bias
The training data used to build the KidLM (plus) model is our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus). We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM (plus) model, as it may still produce biased predictions.
```python
from transformers import pipeline
fill_mask_kidLM_plus = pipeline(
"fill-mask",
model="tafseer-nayeem/KidLM-plus",
top_k=5
)
prompt = "Why are immigrants so <mask>."
predictions_kidLM_plus = fill_mask_kidLM_plus(prompt)
print(predictions_kidLM_plus)
[
{'score': 0.8287580013275146,
'token': 505,
'token_str': 'important',
'sequence': 'Why are immigrants so important.'},
{'score': 0.0266132615506649,
'token': 2702,
'token_str': 'dangerous',
'sequence': 'Why are immigrants so dangerous.'},
{'score': 0.008341682143509388,
'token': 8265,
'token_str': 'scared',
'sequence': 'Why are immigrants so scared.'},
{'score': 0.00794172566384077,
'token': 4456,
'token_str': 'controversial',
'sequence': 'Why are immigrants so controversial.'},
{'score': 0.007879373617470264,
'token': 33338,
'token_str': 'persecuted',
'sequence': 'Why are immigrants so persecuted.'}
]
```
This bias may also affect all fine-tuned versions of this model.
## Citation Information
If you use any of the resources or it's relevant to your work, please cite our [EMNLP 2024 paper](https://aclanthology.org/2024.emnlp-main.277/).
```
@inproceedings{nayeem-rafiei-2024-kidlm,
title = "{K}id{LM}: Advancing Language Models for Children {--} Early Insights and Future Directions",
author = "Nayeem, Mir Tafseer and
Rafiei, Davood",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.277",
pages = "4813--4836",
abstract = "Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children{'}s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.",
}
```
## Contributors
- Mir Tafseer Nayeem ([email protected])
- Davood Rafiei ([email protected])