PatrickHaller
/

hgrn2_pile_100m_distill_babylm

Text Generation

Model card Files Files and versions Community

PatrickHaller commited on Sep 12, 2024

Commit

1968bf4

·

verified ·

1 Parent(s): f6915d5

Update README.md

Files changed (1) hide show

README.md +1 -49

README.md CHANGED Viewed

@@ -3,52 +3,4 @@ datasets:
 - PatrickHaller/dsir-pile-100M-words
 ---
-# Description
-This dataset is a sampled subset of the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset.
-We used [DSIR](https://github.com/p-lambda/dsir) a data selection tool with importance resampling for subsampling.
-The subset sample distribution is:
-```json
-{
-   'Pile-CC': 198245,
-   'OpenWebText2': 122382,
-   'FreeLaw': 37517,
-   'USPTO Backgrounds': 10195,
-   'Wikipedia (en)': 8072,
-   'PubMed Central': 5849,
-   'PubMed Abstracts': 4965,
-   'Gutenberg (PG-19)': 2712,
-   'BookCorpus2': 2550,
-   'Books3': 2432,
-   'StackExchange': 1753,
-   'PhilPapers': 1560,
-   'YoutubeSubtitles': 1187,
-   'OpenSubtitles': 1015,
-   'ArXiv': 610,
-   'NIH ExPorter': 476,
-   'Enron Emails': 439,
-   'EuroParl': 419,
-   'Github': 390,
-   'HackerNews': 259
-}
-```
-The dataset contains ~100M words of text. This can be checked with:
-```python
-from datasets import load_dataset
-ds = load_dataset("PatrickHaller/dsir-pile-100M-words")
-count = 0
-for row in ds["train"]:
-    count += len(row["text"].split(" "))
-print(count)
-# Out: 99999861
-```

 - PatrickHaller/dsir-pile-100M-words
 ---
+Our model for the 2024 BabyLM challenge 100M words track.