llm-conf

Running

App Files Files Community

muellerzr HF staff commited on May 21, 2024

Commit

c09153d

1 Parent(s): e8b88f6

Color

Browse files

Files changed (2) hide show

index.html +5 -5
llm_conf.qmd +5 -5

index.html CHANGED Viewed

@@ -426,7 +426,7 @@
 <li>Backward ~= 2x the model size</li>
 <li>The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):</li>
 </ul>
-<div style="font-size: 50%;background-color: rgba(0,0,0,.1);">
 <table>
 <thead>
 <tr class="header">
@@ -465,7 +465,7 @@
 <p>This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).</p>
 <p>But what happens as we scale?</p>
 <p>Here’s <code>llama-3-8B</code> (8.03B parameters)</p>
-<div style="font-size: 50%;background-color: rgba(0,0,0,.1);">
 <table>
 <thead>
 <tr class="header">
@@ -698,7 +698,7 @@
 <li>Rely on <code>config.yaml</code> files</li>
 <li>Choose to either running <code>accelerate config</code> or write your own:</li>
 </ul>
-<div class="columns" style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);">
 <div class="column" style="width:40%;">
 <div class="code-with-filename">
 <div class="code-with-filename-file">
@@ -804,7 +804,7 @@
 <ul>
 <li>Let’s tie that back up to the model estimator with neat tools like NVIDIA’s TransformerEngine</li>
 </ul>
-<div style="font-size: 60%;background-color: rgba(0,0,0,.1);">
 <table style="width:100%;">
 <colgroup>
 <col style="width: 14%">
@@ -894,7 +894,7 @@
 <ul>
 <li>Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation</li>
 </ul>
-<div style="font-size: 50%;background-color: rgba(0,0,0,.1);">
 <table style="width:100%;">
 <colgroup>
 <col style="width: 16%">

 <li>Backward ~= 2x the model size</li>
 <li>The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):</li>
 </ul>
+<div style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
 <table>
 <thead>
 <tr class="header">
 <p>This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).</p>
 <p>But what happens as we scale?</p>
 <p>Here’s <code>llama-3-8B</code> (8.03B parameters)</p>
+<div style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
 <table>
 <thead>
 <tr class="header">
 <li>Rely on <code>config.yaml</code> files</li>
 <li>Choose to either running <code>accelerate config</code> or write your own:</li>
 </ul>
+<div class="columns" style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
 <div class="column" style="width:40%;">
 <div class="code-with-filename">
 <div class="code-with-filename-file">
 <ul>
 <li>Let’s tie that back up to the model estimator with neat tools like NVIDIA’s TransformerEngine</li>
 </ul>
+<div style="font-size: 60%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
 <table style="width:100%;">
 <colgroup>
 <col style="width: 14%">
 <ul>
 <li>Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation</li>
 </ul>
+<div style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;">
 <table style="width:100%;">
 <colgroup>
 <col style="width: 16%">

llm_conf.qmd CHANGED Viewed

@@ -28,7 +28,7 @@ General estimate (`bert-base-cased`, 108M params):
 - Backward ~= 2x the model size
 - The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):
-::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);"}
 | dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
 |---------|:-----|:------:|:------:|:------:|:------:|
 | float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
@@ -45,7 +45,7 @@ But what happens as we scale?
 Here's `llama-3-8B` (8.03B parameters)
-::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);"}
 | dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
 |---------|:-----|:------:|:------:|:------:|:------:|
 | float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
@@ -202,7 +202,7 @@ accelerate launch script.py
 * Rely on `config.yaml` files
 * Choose to either running `accelerate config` or write your own:
-:::: {.columns style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);"}
 ::: {.column width="40%"}
 ```{.yaml filename=ddp_config.yaml}
 compute_environment: LOCAL_MACHINE
@@ -302,7 +302,7 @@ for batch in dataloader:
 * Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine
-::: {style="font-size: 60%;background-color: rgba(0,0,0,.1);"}
 | Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
 | -- | -- | -- | -- | -- | -- | -- |
 | FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
@@ -326,7 +326,7 @@ What is actually happening:
 * Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation
-::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);"}
 Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local)
 --|--|--|--|--|--
 FSDP | bf16 | default (none) | bf16 | bf16 | bf16

 - Backward ~= 2x the model size
 - The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):
+::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
 | dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
 |---------|:-----|:------:|:------:|:------:|:------:|
 | float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
 Here's `llama-3-8B` (8.03B parameters)
+::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
 | dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
 |---------|:-----|:------:|:------:|:------:|:------:|
 | float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
 * Rely on `config.yaml` files
 * Choose to either running `accelerate config` or write your own:
+:::: {.columns style="font-size: 50%;padding-left:10%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
 ::: {.column width="40%"}
 ```{.yaml filename=ddp_config.yaml}
 compute_environment: LOCAL_MACHINE
 * Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine
+::: {style="font-size: 60%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
 | Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
 | -- | -- | -- | -- | -- | -- | -- |
 | FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
 * Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation
+::: {style="font-size: 50%;background-color: rgba(0,0,0,.1);color: #93a1a1;"}
 Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local)
 --|--|--|--|--|--
 FSDP | bf16 | default (none) | bf16 | bf16 | bf16