andrewdalpino
/

LightGPT

Text Generation

PyTorch

ONNX

English

NoPE

Model card Files Files and versions Community

Andrew DalPino commited on 19 days ago

Commit

f28a628

1 Parent(s): 160e81f

Add FSDP

Browse files

Files changed (4) hide show

README.md +8 -6
beam_search.py +1 -1
model.py +1 -2
pre-train.py +32 -17

README.md CHANGED Viewed

@@ -9,19 +9,20 @@ metrics:
 - perplexity
 pipeline_tag: text-generation
 tags:
-- GPT
 ---
 # LightGPT
-LightGPT is a lightweight generative pre-trained Transformer (GPT) model for the people! Built using pure PyTorch, LightGPT can generate text, answer questions, summarize documents, and more, all using consumer hardware. A unique feature of LightGPT is that you can trade off compute for additional memory-efficiency as needed - allowing you to train larger models on smaller hardware. It also supports memory-efficient pre-training over multiple GPUs or clusters of GPUs using PyTorch's Distributed Data Parallel (DDP) protocol with ZeRO Redundancy sharding.
-## What makes LightGPT different?
-- **Parameter-efficiency**: LightGPT aims to be a more parsimonious model by only training parameters that are absolutely necessary. As such, biases and positional embeddings have been completely removed from the neural network architecture. In addition, the token embeddings and output layer share weight matrices resulting in a further reduction in trainable parameters.
-- **Low VRAM Utilization**: LightGPT's Adafactor optimizer reduces the number of training-time buffers over Adam from O(n*m) to O(n+m) for every trainable weight matrix with minimal effect on runtime and minima quality. In addition, with activation check-pointing enabled, buffers needed to compute gradients during training are reduced by a factor of 10X or more.
-- **Fully open-source**: Unlike closed-source LLMs, LightGPT provides both the model weights *and* the source code to train, fine-tune, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize AI and continually improve the models.
 ## Install Project Dependencies
@@ -100,6 +101,7 @@ Soon ...
 | --num_hidden_layers | 24 | int | The number of attention/MLP blocks within the hidden layer of the network. |
 | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
 | --activation_checkpointing | False | bool | Should we use activation checkpointing? |
 | --checkpoint_interval | 20 | int | Save the model parameters to disk every this many epochs. |
 | --checkpoint_path | "./out/checkpoint.pt" | string | The path to the checkpoint file on disk. |
 | --dataset_path | "./dataset" | string | The path to the dataset files on disk. |

 - perplexity
 pipeline_tag: text-generation
 tags:
+- LightGPT
+- Open-source
 ---
 # LightGPT
+LightGPT is a lightweight generative pre-trained Transformer (GPT) model for the people! Built using pure PyTorch, LightGPT can generate text, answer questions, summarize documents, and more. A unique feature of LightGPT is that it allows you to train larger models on smaller hardware by taking advantage of memory optimizations wherever possible.
+## Features
+- **Parameter-efficiency**: LightGPT aims to be a more parsimonious model by only training parameters that are absolutely necessary. As such, biases and positional embeddings have been completely removed from the architecture. In addition, the token embeddings and output layer share weight matrices resulting in a buy-one-get-one-free deal on trainable parameters.
+- **Low Memory Utilization**: LightGPT employs a number of training-time optimizations that conserve precious VRAM. With zero-redundancy distributed pre-training using fully-sharded data-parallel (FSDP), activation checkpointing, and automatic mixed precision, you'll be able to train larger models by accepting a relatively small amount of communication and computational overhead.
+- **Fully Open-source**: Unlike closed-source LLMs, LightGPT provides both the model weights *and* the source code to train, fine-tune, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize AI and continually improve the models.
 ## Install Project Dependencies
 | --num_hidden_layers | 24 | int | The number of attention/MLP blocks within the hidden layer of the network. |
 | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
 | --activation_checkpointing | False | bool | Should we use activation checkpointing? |
+| --ddp_sharding_level | 2 | (0, 2, 3) | int | The level of sharding to use for DDP training. |
 | --checkpoint_interval | 20 | int | Save the model parameters to disk every this many epochs. |
 | --checkpoint_path | "./out/checkpoint.pt" | string | The path to the checkpoint file on disk. |
 | --dataset_path | "./dataset" | string | The path to the dataset files on disk. |

beam_search.py CHANGED Viewed

@@ -20,7 +20,7 @@ def main():
     parser.add_argument("--checkpoint_path", default="./out/checkpoint.pt", type=str)
     parser.add_argument("--lora_path", default=None, type=str)
-    parser.add_argument("--max_tokens", default=200, type=int)
     parser.add_argument("--num_candidates", default=3, type=int)
     parser.add_argument("--beam_width", default=16, type=int)
     parser.add_argument("--device", default="cuda", type=str)

     parser.add_argument("--checkpoint_path", default="./out/checkpoint.pt", type=str)
     parser.add_argument("--lora_path", default=None, type=str)
+    parser.add_argument("--max_tokens", default=500, type=int)
     parser.add_argument("--num_candidates", default=3, type=int)
     parser.add_argument("--beam_width", default=16, type=int)
     parser.add_argument("--device", default="cuda", type=str)

model.py CHANGED Viewed

@@ -215,13 +215,12 @@ class GPT(Module):
             log_probability: float
             tokens: Tensor
-            @property
             def priority(self) -> float:
                 return self.log_probability
         sort_candidates = partial(
             sorted,
-            key=lambda candidate: candidate.priority,
             reverse=True,
         )

             log_probability: float
             tokens: Tensor
             def priority(self) -> float:
                 return self.log_probability
         sort_candidates = partial(
             sorted,
+            key=lambda candidate: candidate.priority(),
             reverse=True,
         )

pre-train.py CHANGED Viewed

@@ -15,8 +15,7 @@ from torch.amp import autocast
 from torch.cuda import set_device, is_available as cuda_is_available, is_bf16_supported
 from torch.nn.utils import clip_grad_norm_
 from torch.distributed import init_process_group, destroy_process_group
-from torch.distributed.optim import ZeroRedundancyOptimizer
-from torch.nn.parallel import DistributedDataParallel
 from torchmetrics.text import Perplexity
@@ -33,7 +32,7 @@ IS_DDP = WORLD_SIZE > 1
 IS_MASTER = RANK == 0 or not IS_DDP
-DDP_BACKEND = "nccl"  # nccl, gloo, etc.
 def main():
@@ -51,6 +50,7 @@ def main():
     parser.add_argument("--num_attention_heads", default=16, type=int)
     parser.add_argument("--num_hidden_layers", default=32, type=int)
     parser.add_argument("--activation_checkpointing", action="store_true")
     parser.add_argument("--eval_interval", default=10, type=int)
     parser.add_argument("--checkpoint_interval", default=20, type=int)
     parser.add_argument("--checkpoint_path", default="./out/checkpoint.pt", type=str)
@@ -175,19 +175,25 @@ def main():
     model = GPT(**model_args, activation_checkpointing=args.activation_checkpointing)
     if IS_DDP:
-        model = DistributedDataParallel(model, device_ids=[LOCAL_RANK])
     print("Compiling model")
     model = torch.compile(model).to(args.device)
-    if IS_DDP:
-        optimizer = ZeroRedundancyOptimizer(
-            model.parameters(),
-            optimizer_class=Adafactor,
-            lr=args.learning_rate,
-        )
-    else:
-        optimizer = Adafactor(model.parameters(), lr=args.learning_rate)
     starting_epoch = 1
@@ -210,7 +216,7 @@ def main():
     perplexity_metric = Perplexity(ignore_index=training.PADDING_INDEX).to(args.device)
-    signal.signal(signal.SIGTERM, on_sigterm)
     print("Pre-training ...")
@@ -294,19 +300,28 @@ def main():
             print("Checkpoint saved")
     if IS_DDP:
-        destroy_process_group()
     print("Done!")
-def on_sigterm(signum, frame):
-    print("Hold on, attempting to exit gracefully.")
     if IS_DDP:
-        destroy_process_group()
     sys.exit(0)
 if __name__ == "__main__":
     main()

 from torch.cuda import set_device, is_available as cuda_is_available, is_bf16_supported
 from torch.nn.utils import clip_grad_norm_
 from torch.distributed import init_process_group, destroy_process_group
+from torch.distributed.fsdp import FullyShardedDataParallel, ShardingStrategy
 from torchmetrics.text import Perplexity
 IS_MASTER = RANK == 0 or not IS_DDP
+DDP_BACKEND = "nccl"
 def main():
     parser.add_argument("--num_attention_heads", default=16, type=int)
     parser.add_argument("--num_hidden_layers", default=32, type=int)
     parser.add_argument("--activation_checkpointing", action="store_true")
+    parser.add_argument("--ddp_sharding_level", default=2, choices=[0, 2, 3])
     parser.add_argument("--eval_interval", default=10, type=int)
     parser.add_argument("--checkpoint_interval", default=20, type=int)
     parser.add_argument("--checkpoint_path", default="./out/checkpoint.pt", type=str)
     model = GPT(**model_args, activation_checkpointing=args.activation_checkpointing)
     if IS_DDP:
+        match args.ddp_sharding_level:
+            case 0:
+                sharding_strategy = ShardingStrategy.NO_SHARD
+            case 2:
+                sharding_strategy = ShardingStrategy.SHARD_GRAD_OP
+            case 3:
+                sharding_strategy = ShardingStrategy.FULL_SHARD
+        model = FullyShardedDataParallel(
+            model,
+            device_id=LOCAL_RANK,
+            sharding_strategy=sharding_strategy,
+            use_orig_params=True,
+        )
     print("Compiling model")
     model = torch.compile(model).to(args.device)
+    optimizer = Adafactor(model.parameters(), lr=args.learning_rate)
     starting_epoch = 1
     perplexity_metric = Perplexity(ignore_index=training.PADDING_INDEX).to(args.device)
+    register_signal_handlers()
     print("Pre-training ...")
             print("Checkpoint saved")
     if IS_DDP:
+        ddp_cleanup()
     print("Done!")
+def register_signal_handlers():
+    signal.signal(signal.SIGINT, shutdown)
+    signal.signal(signal.SIGTERM, shutdown)
+def shutdown(signum, frame):
+    print("Hold on, attempting to exit gracefully")
     if IS_DDP:
+        ddp_cleanup()
     sys.exit(0)
+def ddp_cleanup():
+    destroy_process_group()
 if __name__ == "__main__":
     main()