Papers
arxiv:2403.03507

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Published on Mar 6, 2024
ยท Submitted by akhaliq on Mar 7, 2024
#1 Paper of the day

Abstract

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Community

We need official github code pls and hf integration... What a cool project

ยท

I also would like to see as much source code as possible please. Very much appreciated

This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Some thoughts / ideas, I don't know if they make sense or not:

  1. Instead of r being a hyperparameter, could this be a threshold on the singular value instead? Even something like using random matrix theory to find the spectral threshold for signal-noise?

  2. Instead of T being a hyperparameter, could you measure how "diagonally" the P_t^T G_t Q_t is? I believe the intuition in the paper is that we'd like to regularly "refresh" the principle directions corresponding to the full-rank supports in case they drift over time. As I understand it, the projected gradient is initially just the diagonal of singular values, and it'll drift away from that structure over time (I'm making a big assumption that this drift is gradual and inversely related with how good P,Q still act as the principle directions). It seems like you can quantify that drift somehow and use it to drive whether or not P,Q are still good principle directions for the gradient updates.

For Figure 1.

6evoFc9NWrrRCjH.png

Could you also include what 8bit-Adam + per-layer weight updates but without the rank-reduction on the gradient update would have affected the memory use? It seems like (based on the Lomo paper / https://arxiv.org/abs/2306.09782) that it'd also significantly reduce that light-green part of the memory use since the gradient is consumed+discarded at each layer immediately?

ยท
Paper author
โ€ข
edited Mar 10, 2024

Thanks for your comments! We have third party evaluation here: https://github.com/jiaweizzhao/GaLore/issues/6. GaLore alone (without per-layer weight update) has comparable memory reduction as per-layer weight update. They are orthogonal techniques. By combining them together you can run 7B pre-training within 24G memory (e.g., 4090).

Very powerful technology.

Incredible paper! Im excited to see how this unfolds overtime. Ive become a fan of LoRA's small update footprint especially for serving. But for some use-cases I can see wanting to have more performance.

Id also be curious to see:

  • Downstream task performance across diverse tasks/metrics
  • Memory scenarios for common use-cases. How much of a benefit do I get from GaLore vs LoRA or others, or are they all pretty similar.

GaLore: Revolutionizing LLM Training with Memory-Efficient Gradient Projections

Links ๐Ÿ”—:

๐Ÿ‘‰ Subscribe: https://www.youtube.com/@Arxflix
๐Ÿ‘‰ Twitter: https://x.com/arxflix
๐Ÿ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Looking for clarification on Figure 1: what is batch size, sequence len, and vocab size here? Because I would expect activations to take up more space...

  • batch size seems to be 256 based on Fig. 1 caption
  • sequence len seems to be 2048, based on footnote 1
  • vocab size is 32000, based on config from repo
  • bf16 used so 2 bytes per float, based on footnote 2

So only the logits of the model should take up 256 * 2048 * 32000 * 2 bytes or 31.25 GB. Where is this required memory in Figure 1?

Thanks!

It seems like it would be possible to combine Lora and galore (and their quantized counterparts qlora and qgalore) to further reduce memory footprint by using galore on the gradients for lora matrix A and B. Has anyone tried to experiment with this? I can't find the information in the paper as they mostly view their work as a complete alternative to lora.

Sign up or log in to comment

Models citing this paper 7

Browse 7 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.03507 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 59