KidLM-plus / README.md
tafseer-nayeem's picture
Update README.md
b94e97c verified
metadata
license: apache-2.0
datasets:
  - tafseer-nayeem/KidLM-corpus
language:
  - en
base_model:
  - FacebookAI/roberta-base
pipeline_tag: fill-mask
library_name: transformers

KidLM (plus) Model

We continue to pre-train the RoBERTa (base) model on our KidLM corpus using a masked language modeling (MLM) objective. The KidLM (plus) model introduces a masking strategy called Stratified Masking, which varies the probability of masking based on word classes. This approach enhances the model's focus on tokens that are more informative and specifically tailored to children's language needs, aiming to steer language model predictions towards child-specific vocabulary derived from our high-quality KidLM corpus.

To achieve this, Stratified Masking is introduced based on two key principles:

  1. All words in our corpus have a non-zero probability of being masked.
  2. Words more commonly found in a general corpus are masked with a lower probability.

Based on these principles, each word in our corpus is assigned to one of the following three strata:

  • Stopwords: These are the most frequent words in the language. We apply a 0.15 masking rate to these words.

  • Dale-Chall Easy Words: To prioritize linguistic simplicity specific to children, we apply a slightly higher masking rate of 0.20 to these words.

  • Other Words: This category often includes nouns and entities that reflect children's interests and preferences. We assign a higher masking rate of 0.25 to emphasize their informative importance during training.

For more details, please refer to our EMNLP 2024 paper.

How to use

You can use this model directly with a pipeline for masked language modeling:

from transformers import pipeline

fill_mask_kidLM_plus = pipeline(
        "fill-mask",
        model="tafseer-nayeem/KidLM-plus",
        top_k=5
)

prompt = "On my birthday, I want <mask>."

predictions_kidLM_plus = fill_mask_kidLM_plus(prompt)

print(predictions_kidLM_plus)

Outputs:

[
{'score': 0.5298162698745728, 
 'token': 7548, 
 'token_str': 'chocolate', 
 'sequence': 'On my birthday, I want chocolate.'}, 
{'score': 0.08184309303760529, 
 'token': 8492, 
 'token_str': 'cake', 
 'sequence': 'On my birthday, I want cake.'}, 
{'score': 0.033250316977500916, 
 'token': 12644, 
 'token_str': 'candy', 
 'sequence': 'On my birthday, I want candy.'}, 
{'score': 0.03274081274867058, 
 'token': 2690, 
 'token_str': 'stars', 
 'sequence': 'On my birthday, I want stars.'}, 
{'score': 0.024002602323889732, 
 'token': 27116, 
 'token_str': 'puppies', 
 'sequence': 'On my birthday, I want puppies.'}
]

Limitations and bias

The training data used to build the KidLM (plus) model is our KidLM corpus. We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM (plus) model, as it may still produce biased predictions.

from transformers import pipeline

fill_mask_kidLM_plus = pipeline(
        "fill-mask",
        model="tafseer-nayeem/KidLM-plus",
        top_k=5
)

prompt = "Why are immigrants so <mask>."

predictions_kidLM_plus = fill_mask_kidLM_plus(prompt)

print(predictions_kidLM_plus)

[
{'score': 0.8287580013275146, 
 'token': 505, 
 'token_str': 'important', 
 'sequence': 'Why are immigrants so important.'}, 
{'score': 0.0266132615506649, 
 'token': 2702, 
 'token_str': 'dangerous', 
 'sequence': 'Why are immigrants so dangerous.'}, 
{'score': 0.008341682143509388, 
 'token': 8265, 
 'token_str': 'scared', 
 'sequence': 'Why are immigrants so scared.'}, 
{'score': 0.00794172566384077, 
 'token': 4456, 
 'token_str': 'controversial', 
 'sequence': 'Why are immigrants so controversial.'}, 
{'score': 0.007879373617470264, 
 'token': 33338, 
 'token_str': 'persecuted', 
 'sequence': 'Why are immigrants so persecuted.'}
]

This bias may also affect all fine-tuned versions of this model.

Citation Information

If you use any of the resources or it's relevant to your work, please cite our EMNLP 2024 paper.

@inproceedings{nayeem-rafiei-2024-kidlm,
    title = "{K}id{LM}: Advancing Language Models for Children {--} Early Insights and Future Directions",
    author = "Nayeem, Mir Tafseer  and
      Rafiei, Davood",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.277",
    pages = "4813--4836",
    abstract = "Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children{'}s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.",
}

Contributors