ksterx commited on
Commit
6acd15c
·
verified ·
1 Parent(s): a3f70ad

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +6 -6
  2. logo.png +0 -0
README.md CHANGED
@@ -6,9 +6,9 @@ license: mit
6
  library_name: transformers
7
  ---
8
 
9
- ![SpiralAI RetNet-3b-ja-base](logo.jpg)
10
 
11
- # SpiralAI RetNet-3b-ja-base
12
 
13
  We have conducted pre-training from scratch on the RetNet (https://arxiv.org/abs/2307.08621) architecture model 3b using a mixed dataset of Japanese and English.
14
  This model is released primarily for the basic research of "retention mechanism".
@@ -16,7 +16,7 @@ This model is released primarily for the basic research of "retention mechanism"
16
  # Model Description
17
 
18
  - **Developed by:** [SpiralAI](https://go-spiral.ai/)
19
- - **Model type:** The `SpiralAI RetNet-3b-ja-base` is a language model equipped with a retention mechanism. It uses the `cyberagent/calm2-7b-chat` tokenizer.
20
  - **Languages:** Japanese, English.
21
  - **License:** MIT
22
  - **Training:** Trained on 80b tokens.
@@ -98,15 +98,15 @@ Here we show the result of the last layer.
98
 
99
  ## Test loss comparison
100
 
101
- We compared the test loss of `Spiral-AI/RetNet-3b-ja-base` and `cyberagent/open-calm-3b` on different length of tokens.
102
  The first 100 examples are extracted from `wikipedia-ja` for the test dataset.
103
 
104
  ![test_loss](loss_comparison.png)
105
 
106
  Key findings are:
107
 
108
- - The test loss of `Spiral-AI/RetNet-3b-ja-base` goes as low as `cyberagent/open-calm-3b`, showing the effectiveness of the retention mechanism.
109
- - The explosion of test loss is suppressed in `Spiral-AI/RetNet-3b-ja-base` when the context length goes longer than 2,048 tokens (the maximum context length of training data; Note that `cyberagent/open-calm-3b` is trained on the same context length.).
110
 
111
  # Training Datasets
112
 
 
6
  library_name: transformers
7
  ---
8
 
9
+ ![SpiralAI Spiral-RetNet-3b-base](logo.png)
10
 
11
+ # SpiralAI Spiral-RetNet-3b-base
12
 
13
  We have conducted pre-training from scratch on the RetNet (https://arxiv.org/abs/2307.08621) architecture model 3b using a mixed dataset of Japanese and English.
14
  This model is released primarily for the basic research of "retention mechanism".
 
16
  # Model Description
17
 
18
  - **Developed by:** [SpiralAI](https://go-spiral.ai/)
19
+ - **Model type:** The `SpiralAI Spiral-RetNet-3b-base` is a language model equipped with a retention mechanism. It uses the `cyberagent/calm2-7b-chat` tokenizer.
20
  - **Languages:** Japanese, English.
21
  - **License:** MIT
22
  - **Training:** Trained on 80b tokens.
 
98
 
99
  ## Test loss comparison
100
 
101
+ We compared the test loss of `Spiral-AI/Spiral-RetNet-3b-base` and `cyberagent/open-calm-3b` on different length of tokens.
102
  The first 100 examples are extracted from `wikipedia-ja` for the test dataset.
103
 
104
  ![test_loss](loss_comparison.png)
105
 
106
  Key findings are:
107
 
108
+ - The test loss of `Spiral-AI/Spiral-RetNet-3b-base` goes as low as `cyberagent/open-calm-3b`, showing the effectiveness of the retention mechanism.
109
+ - The explosion of test loss is suppressed in `Spiral-AI/Spiral-RetNet-3b-base` when the context length goes longer than 2,048 tokens (the maximum context length of training data; Note that `cyberagent/open-calm-3b` is trained on the same context length.).
110
 
111
  # Training Datasets
112
 
logo.png ADDED