Abstract
Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approx100K prompt-response pairs) and continued pretraining (approx10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.
Community
@wuaoscotty123 just following up here. The final accepted version of the paper includes all the target modules, including the gate_proj: https://openreview.net/forum?id=aloEru2qCG. We've also uploaded the model weights/adapters for all checkpoints here: https://huggingface.co/LoRA-TMLR-2024
Good question. We mention it at page 3 footnote 1. We excluded it for historical reasons and comparing to other 7b transformer architectures without gate_proj. Given the results, I anticipate that including it should increase target domain performance but at the same time increase forgetting.
We've released model checkpoints and LoRA adapters as research artifacts here: https://huggingface.co/LoRA-TMLR-2024
Models citing this paper 13
Browse 13 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper