muellerzr HF staff commited on
Commit
c09153d
·
1 Parent(s): e8b88f6
Files changed (2) hide show
  1. index.html +5 -5
  2. llm_conf.qmd +5 -5
index.html CHANGED
@@ -426,7 +426,7 @@
426
  <li>Backward ~= 2x the model size</li>
427
  <li>The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):</li>
428
  </ul>
429
- <div style="font-size: 50%;background-color: rgba(0,0,0,.1);">
430
  <table>
431
  <thead>
432
  <tr class="header">
@@ -465,7 +465,7 @@
465
  <p>This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).</p>
466
  <p>But what happens as we scale?</p>
467
  <p>Here’s <code>llama-3-8B</code> (8.03B parameters)</p>
468
- <div style="font-size: 50%;background-color: rgba(0,0,0,.1);">
469
  <table>
470
  <thead>
471
  <tr class="header">
@@ -698,7 +698,7 @@
698
  <li>Rely on <code>config.yaml</code> files</li>
699
  <li>Choose to either running <code>accelerate config</code> or write your own:</li>
700
  </ul>
701
- <div class="columns" style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);">
702
  <div class="column" style="width:40%;">
703
  <div class="code-with-filename">
704
  <div class="code-with-filename-file">
@@ -804,7 +804,7 @@
804
  <ul>
805
  <li>Let’s tie that back up to the model estimator with neat tools like NVIDIA’s TransformerEngine</li>
806
  </ul>
807
- <div style="font-size: 60%;background-color: rgba(0,0,0,.1);">
808
  <table style="width:100%;">
809
  <colgroup>
810
  <col style="width: 14%">
@@ -894,7 +894,7 @@
894
  <ul>
895
  <li>Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation</li>
896
  </ul>
897
- <div style="font-size: 50%;background-color: rgba(0,0,0,.1);">
898
  <table style="width:100%;">
899
  <colgroup>
900
  <col style="width: 16%">
 
426
  <li>Backward ~= 2x the model size</li>
427
  <li>The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):</li>
428
  </ul>
429
+ <div style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
430
  <table>
431
  <thead>
432
  <tr class="header">
 
465
  <p>This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).</p>
466
  <p>But what happens as we scale?</p>
467
  <p>Here’s <code>llama-3-8B</code> (8.03B parameters)</p>
468
+ <div style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
469
  <table>
470
  <thead>
471
  <tr class="header">
 
698
  <li>Rely on <code>config.yaml</code> files</li>
699
  <li>Choose to either running <code>accelerate config</code> or write your own:</li>
700
  </ul>
701
+ <div class="columns" style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
702
  <div class="column" style="width:40%;">
703
  <div class="code-with-filename">
704
  <div class="code-with-filename-file">
 
804
  <ul>
805
  <li>Let’s tie that back up to the model estimator with neat tools like NVIDIA’s TransformerEngine</li>
806
  </ul>
807
+ <div style="font-size: 60%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
808
  <table style="width:100%;">
809
  <colgroup>
810
  <col style="width: 14%">
 
894
  <ul>
895
  <li>Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation</li>
896
  </ul>
897
+ <div style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
898
  <table style="width:100%;">
899
  <colgroup>
900
  <col style="width: 16%">
llm_conf.qmd CHANGED
@@ -28,7 +28,7 @@ General estimate (`bert-base-cased`, 108M params):
28
  - Backward ~= 2x the model size
29
  - The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):
30
 
31
- ::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);"}
32
  | dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
33
  |---------|:-----|:------:|:------:|:------:|:------:|
34
  | float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
@@ -45,7 +45,7 @@ But what happens as we scale?
45
 
46
  Here's `llama-3-8B` (8.03B parameters)
47
 
48
- ::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);"}
49
  | dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
50
  |---------|:-----|:------:|:------:|:------:|:------:|
51
  | float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
@@ -202,7 +202,7 @@ accelerate launch script.py
202
  * Rely on `config.yaml` files
203
  * Choose to either running `accelerate config` or write your own:
204
 
205
- :::: {.columns style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);"}
206
  ::: {.column width="40%"}
207
  ```{.yaml filename=ddp_config.yaml}
208
  compute_environment: LOCAL_MACHINE
@@ -302,7 +302,7 @@ for batch in dataloader:
302
 
303
  * Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine
304
 
305
- ::: {style="font-size: 60%;background-color: rgba(0,0,0,.1);"}
306
  | Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
307
  | -- | -- | -- | -- | -- | -- | -- |
308
  | FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
@@ -326,7 +326,7 @@ What is actually happening:
326
 
327
  * Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation
328
 
329
- ::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);"}
330
  Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local)
331
  --|--|--|--|--|--
332
  FSDP | bf16 | default (none) | bf16 | bf16 | bf16
 
28
  - Backward ~= 2x the model size
29
  - The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):
30
 
31
+ ::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
32
  | dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
33
  |---------|:-----|:------:|:------:|:------:|:------:|
34
  | float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
 
45
 
46
  Here's `llama-3-8B` (8.03B parameters)
47
 
48
+ ::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
49
  | dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
50
  |---------|:-----|:------:|:------:|:------:|:------:|
51
  | float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
 
202
  * Rely on `config.yaml` files
203
  * Choose to either running `accelerate config` or write your own:
204
 
205
+ :::: {.columns style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
206
  ::: {.column width="40%"}
207
  ```{.yaml filename=ddp_config.yaml}
208
  compute_environment: LOCAL_MACHINE
 
302
 
303
  * Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine
304
 
305
+ ::: {style="font-size: 60%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
306
  | Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
307
  | -- | -- | -- | -- | -- | -- | -- |
308
  | FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
 
326
 
327
  * Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation
328
 
329
+ ::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
330
  Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local)
331
  --|--|--|--|--|--
332
  FSDP | bf16 | default (none) | bf16 | bf16 | bf16