LightGPT
LightGPT is a lightweight generative pre-trained Transformer (GPT) model for the people! Built using pure PyTorch, LightGPT can generate text, answer questions, summarize documents, and more. A unique feature of LightGPT is that it allows you to train larger models on smaller hardware by taking advantage of memory optimizations wherever possible.
Features
Parameter-efficiency: LightGPT aims to be a more parsimonious model by only training parameters that are absolutely necessary. As such, biases and positional embeddings have been completely removed from the architecture. In addition, the token embeddings and output layer share weight matrices resulting in a buy-one-get-one-free deal on trainable parameters.
Low Memory Utilization: LightGPT employs a number of training-time optimizations that conserve precious VRAM. With zero-redundancy distributed pre-training using fully-sharded data-parallel (FSDP), activation checkpointing, and automatic mixed precision, you'll be able to train larger models by accepting a relatively small amount of communication and computational overhead.
Fully Open-source: Unlike closed-source LLMs, LightGPT provides both the model weights and the source code to train, fine-tune, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize AI and continually improve the models.
Install Project Dependencies
Project dependencies are specified in the requirements.txt
file. You can install them with pip using the following command from the project root. I recommend using a virtual environment such as venv to keep package dependencies on your system tidy.
python -m venv ./.venv
source ./.venv/bin/activate
pip install -r requirements.txt
Pre-training
For the pre-training corpus we use the Openwebtext dataset which consists of about 9B high-quality token sequences gathered from the worldwide web. In addition, you can add as much pre-training data as you like with a custom dataloader. If you'd just like to start training right away, the default settings should work on most single-GPU systems with 12G of VRAM or more.
python pre-train.py
Note that it will take a while to download and pre-process the dataset the first time that the training script is run.
To customize the default "lightgpt-small" architecture you can adjust the block_size
, embedding_dimensions
, num_hidden_layers
, and num_attention_heads
arguments of the pre-training script. Refer to the model_sizing.ipynb
notebook for an estimation of the memory and compute requirements for your chosen architecture.
python pre-train.py --block_size=2048 --embedding_dimensions=4096 --num_hidden_layers=64 --num_attention_heads=64
You can also adjust the batch_size
, learning_rate
, and gradient_accumulation_steps
to suite your training setup.
python pre-train.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=128
For distributed training, use PyTorch's torchrun extension to launch a distributed data parallel session. The example below is for executing the training script on a single node with individual 8 GPUs.
torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16 --gradient_accumulation_steps=128
Note that when training in data-parallel mode it's important that the
gradient_accumulation_steps
divides evenly into the world size for maximum performance. For example, if we have an 8 GPU cluster, we could perform 32 gradient accumulation steps in exactly 4 passes over the network.
Text Generation
After training, you can generate text from the model by running the generate.py
script from the commandline. This inference script samples tokens from the model one at a time conditioned on a prompt and any previously generated tokens, together referred to as the context window. In the example below we are choosing to only sample from the top_k
predicted tokens that have at least top_p
cumulative probability mass when ordered descending by predicted probability.
python generate.py --top_k=500 --top_p=0.9
We also provide a script that samples entire sequences rather than single tokens independently which we call beam_search.py
. Beam Search maintains a list of the top beam_width
sequence candidates and outputs the top num_candidates
completed sequences with the highest overall priority. It is a form of greedy search that works well for some things like text summarization and translation but often results in less natural responses as natural language follows a more stochastic process.
python beam_search.py --beam_width=16 --num_candidates=3
Instruction-tuning
Soon ...
Pre-training Arguments
Argument | Default | Type | Description |
---|---|---|---|
--batch_size | 1 | int | The number of samples to pass through the network at a time. |
--gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
--samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
--learning_rate | 5e-4 | float | The global step size taken after every gradient accumulation step. |
--max_gradient_norm | 1.0 | float | Clip gradients above this threshold before stepping. |
--num_epochs | 2145 | int | The number of epochs to train for. |
--eval_interval | 10 | int | Evaluate the model after this many epochs on the testing set. |
--block_size | 1024 | int | The number of tokens within the context window for every sample. |
--embedding_dimensions | 1024 | int | The dimensionality of the token embeddings. |
--num_attention_heads | 16 | int | The number of attention heads within every block. |
--num_hidden_layers | 24 | int | The number of attention/MLP blocks within the hidden layer of the network. |
--dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
--activation_checkpointing | False | bool | Should we use activation checkpointing? |
--ddp_sharding_level | 2 | (0, 2, 3) | int |
--checkpoint_interval | 20 | int | Save the model parameters to disk every this many epochs. |
--checkpoint_path | "./out/checkpoint.pt" | string | The path to the checkpoint file on disk. |
--dataset_path | "./dataset" | string | The path to the dataset files on disk. |
--num_dataset_processes | 8 | int | The number of processes (CPUs) to use to process the dataset. |
--resume | False | bool | Should we resume training from the last checkpoint? |
--device | "cuda" | string | The device to run the computation on. |
--seed | None | int | The seed for the random number generator. |
Instruction-tuning Arguments
Argument | Default | Type | Description |
---|---|---|---|
--base_model_path | "./out/checkpoint.pt" | string | The path to the pre-trained model. |
--batch_size | 1 | int | The number of samples to pass through the network at a time. |
--gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
--learning_rate | 5e-4 | float | The global step size taken after every gradient accumulation step. |
--mask_input | False | bool | Should we mask the input part of the sample i.e. only train on the output? |
--rank | 8 | int | The rank of the LoRA decomposition matrices. |
--alpha | 1.0 | float | The strength of the LoRA signal. |
--dropout | 0.05 | float | The proportion of signals to send to zero during training as regularization. |
--num_epochs | 4 | int | The number of epochs to train for. |
--eval_interval | 1 | int | Evaluate the model after this many epochs on the testing set. |
--checkpoint_interval | 1 | int | Save the model parameters to disk every this many epochs. |
--checkpoint_path | "./out/lora_instruction.pt" | string | The path to the checkpoint file on disk. |
--resume | False | bool | Should we resume training from the last checkpoint? |
--device | "cuda" | string | The device to run the computation on. |
--seed | None | int | The seed for the random number generator. |
Generation Arguments
Argument | Default | Type | Description |
---|---|---|---|
--checkpoint_path | "./out/checkpoint.pt" | string | The path to the checkpoint file on disk. |
--lora_path | None | string | The path to the LoRA checkpoint. |
--max_tokens | 500 | int | The maximum number of tokens that the model should generate per sample. |
--temperature | 1.0 | float | The amount of regularization applied to the candidate token probabilities. |
--top_k | 500 | int | Only sample from this many candidate tokens with the highest probabilities. |
--top_p | 0.9 | float | Of the top_k tokens, drop all but the top_p portion of the cumulative probability distribution. |
--device | "cuda" | string | The device to run the computation on. |
--seed | None | int | The seed for the random number generator. |
Beam Search Arguments
Argument | Default | Type | Description |
---|---|---|---|
--checkpoint_path | "./out/checkpoint.pt" | string | The path to the checkpoint file on disk. |
--lora_path | None | string | The path to the LoRA checkpoint. |
--max_tokens | 200 | int | The maximum number of tokens that the model should generate per sample. |
--num_candidates | 3 | int | The number of candidate sequences to output. |
--beam_width | 16 | int | The number of candidate sequences to keep track of during search. |
--device | "cuda" | string | The device to run the computation on. |
--seed | None | int | The seed for the random number generator. |
References:
- A. Radford, et al. Language Models are Unsupervised Multitask Learners, OpenAI, 2019.
- T. Brown, et al. Language Models are Few-Shot Learners. OpenAI, 2020.
- A. Kazemnejad, et al. The Impact of Positional Encoding on Length Generalization in Transformers, 37th Conference on Neural Information Processing Systems (NeurIPS 2023).
- S. Rajbhandari, et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, 2020.
- J. R. Hermans, et al. Accumulated Gradient Normalization, JMLR: Workshop and Conference Proceedings, 2017.
- T. Chen, et al. Training Deep Nets with Sublinear Memory Cost. MIT, 2019.
- B. Zhang, et al. Root Mean Square Layer Normalization. 33rd Conference on Neural Information Processing Systems, NeurIPS 2019.