PatrickHaller commited on
Commit
1968bf4
·
verified ·
1 Parent(s): f6915d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -49
README.md CHANGED
@@ -3,52 +3,4 @@ datasets:
3
  - PatrickHaller/dsir-pile-100M-words
4
  ---
5
 
6
-
7
- # Description
8
-
9
- This dataset is a sampled subset of the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset.
10
- We used [DSIR](https://github.com/p-lambda/dsir) a data selection tool with importance resampling for subsampling.
11
-
12
- The subset sample distribution is:
13
-
14
- ```json
15
- {
16
- 'Pile-CC': 198245,
17
- 'OpenWebText2': 122382,
18
- 'FreeLaw': 37517,
19
- 'USPTO Backgrounds': 10195,
20
- 'Wikipedia (en)': 8072,
21
- 'PubMed Central': 5849,
22
- 'PubMed Abstracts': 4965,
23
- 'Gutenberg (PG-19)': 2712,
24
- 'BookCorpus2': 2550,
25
- 'Books3': 2432,
26
- 'StackExchange': 1753,
27
- 'PhilPapers': 1560,
28
- 'YoutubeSubtitles': 1187,
29
- 'OpenSubtitles': 1015,
30
- 'ArXiv': 610,
31
- 'NIH ExPorter': 476,
32
- 'Enron Emails': 439,
33
- 'EuroParl': 419,
34
- 'Github': 390,
35
- 'HackerNews': 259
36
- }
37
- ```
38
-
39
-
40
- The dataset contains ~100M words of text. This can be checked with:
41
-
42
- ```python
43
- from datasets import load_dataset
44
-
45
- ds = load_dataset("PatrickHaller/dsir-pile-100M-words")
46
-
47
- count = 0
48
- for row in ds["train"]:
49
- count += len(row["text"].split(" "))
50
-
51
- print(count)
52
-
53
- # Out: 99999861
54
- ```
 
3
  - PatrickHaller/dsir-pile-100M-words
4
  ---
5
 
6
+ Our model for the 2024 BabyLM challenge 100M words track.