jacobfulano
commited on
Commit
·
2885f1f
1
Parent(s):
c66f045
Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,8 @@ language:
|
|
7 |
---
|
8 |
|
9 |
# MosaicBERT-Base model
|
10 |
-
MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
|
|
|
11 |
|
12 |
### Model Date
|
13 |
|
@@ -69,7 +70,36 @@ the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.co
|
|
69 |
from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining
|
70 |
corpora like English Wikipedia and BooksCorpus.
|
71 |
|
72 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
|
74 |
## Evaluation results
|
75 |
|
|
|
7 |
---
|
8 |
|
9 |
# MosaicBERT-Base model
|
10 |
+
MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
|
11 |
+
MosaicBERT-Base achieves higher pretraining and finetuning accuracy than [bert-base-uncased](https://huggingface.co/bert-base-uncased).
|
12 |
|
13 |
### Model Date
|
14 |
|
|
|
70 |
from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining
|
71 |
corpora like English Wikipedia and BooksCorpus.
|
72 |
|
73 |
+
## Pretraining Optimizations
|
74 |
+
|
75 |
+
Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
|
76 |
+
|
77 |
+
1. MosaicML Streaming Dataset
|
78 |
+
As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
|
79 |
+
for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 128; this covers 78.6% of C4.
|
80 |
+
|
81 |
+
|
82 |
+
3. Higher Masking Ratio for the Masked Language Modeling Objective
|
83 |
+
We used the standard Masked Language Modeling (MLM) pretraining objective.
|
84 |
+
While the original BERT paper also included a Next Sentence Prediction (NSP) task in the pretraining objective,
|
85 |
+
subsequent papers have shown this to be unnecessary [Liu et al. 2019](https://arxiv.org/abs/1907.11692). For Hugging Face BERT-Base, we used the standard 15% masking ratio.
|
86 |
+
However, we found that a 30% masking ratio led to slight accuracy improvements in both pretraining MLM and downstream GLUE performance.
|
87 |
+
We therefore included this simple change as part of our MosaicBERT training recipe. Recent studies have also found that this simple
|
88 |
+
change can lead to downstream improvements [Wettig et al. 2022](https://arxiv.org/abs/2202.08005).
|
89 |
+
|
90 |
+
4. Bfloat16 Precision
|
91 |
+
We use [bf16 (bfloat16) mixed precision training](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) for all the models, where a matrix multiplication layer uses bf16
|
92 |
+
for the multiplication and 32-bit IEEE floating point for gradient accumulation. We found this to be more stable than using float16 mixed precision.
|
93 |
+
|
94 |
+
5. Vocab Size as a Multiple of 64
|
95 |
+
We increased the vocab size to be a multiple of 8 as well as 64 (i.e. from 30,522 to 30,528).
|
96 |
+
This small constraint is something of [a magic trick among ML practitioners](https://twitter.com/karpathy/status/1621578354024677377), and leads to a throughput speedup.
|
97 |
+
|
98 |
+
6. Hyperparameters
|
99 |
+
For all models, we use Decoupled AdamW with Beta1=0.9 and Beta2=0.98, and a weight decay value of 1.0e-5. The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero. Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches. We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768. These hyperparameters were the same for MosaicBERT-Base and the baseline BERT-Base.
|
100 |
+
For the baseline BERT, we applied the standard 0.1 dropout to both the attention and feedforward layers of the transformer block. For MosaicBERT, however, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
|
101 |
+
Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
|
102 |
+
|
103 |
|
104 |
## Evaluation results
|
105 |
|