PatrickHaller
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -3,52 +3,4 @@ datasets:
|
|
3 |
- PatrickHaller/dsir-pile-100M-words
|
4 |
---
|
5 |
|
6 |
-
|
7 |
-
# Description
|
8 |
-
|
9 |
-
This dataset is a sampled subset of the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset.
|
10 |
-
We used [DSIR](https://github.com/p-lambda/dsir) a data selection tool with importance resampling for subsampling.
|
11 |
-
|
12 |
-
The subset sample distribution is:
|
13 |
-
|
14 |
-
```json
|
15 |
-
{
|
16 |
-
'Pile-CC': 198245,
|
17 |
-
'OpenWebText2': 122382,
|
18 |
-
'FreeLaw': 37517,
|
19 |
-
'USPTO Backgrounds': 10195,
|
20 |
-
'Wikipedia (en)': 8072,
|
21 |
-
'PubMed Central': 5849,
|
22 |
-
'PubMed Abstracts': 4965,
|
23 |
-
'Gutenberg (PG-19)': 2712,
|
24 |
-
'BookCorpus2': 2550,
|
25 |
-
'Books3': 2432,
|
26 |
-
'StackExchange': 1753,
|
27 |
-
'PhilPapers': 1560,
|
28 |
-
'YoutubeSubtitles': 1187,
|
29 |
-
'OpenSubtitles': 1015,
|
30 |
-
'ArXiv': 610,
|
31 |
-
'NIH ExPorter': 476,
|
32 |
-
'Enron Emails': 439,
|
33 |
-
'EuroParl': 419,
|
34 |
-
'Github': 390,
|
35 |
-
'HackerNews': 259
|
36 |
-
}
|
37 |
-
```
|
38 |
-
|
39 |
-
|
40 |
-
The dataset contains ~100M words of text. This can be checked with:
|
41 |
-
|
42 |
-
```python
|
43 |
-
from datasets import load_dataset
|
44 |
-
|
45 |
-
ds = load_dataset("PatrickHaller/dsir-pile-100M-words")
|
46 |
-
|
47 |
-
count = 0
|
48 |
-
for row in ds["train"]:
|
49 |
-
count += len(row["text"].split(" "))
|
50 |
-
|
51 |
-
print(count)
|
52 |
-
|
53 |
-
# Out: 99999861
|
54 |
-
```
|
|
|
3 |
- PatrickHaller/dsir-pile-100M-words
|
4 |
---
|
5 |
|
6 |
+
Our model for the 2024 BabyLM challenge 100M words track.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|