Question about Gradient Accumulation step in Trainer

ben9004 · September 10, 2021, 8:38am

I can see that gradient accumulation step helps to increase batch size.

and also I can understand if the model has Batch Norm layer, gradient accumulation will not guarantee the exact same performance as the model that we trained in a large batch size (not using accumulation)

but, most of models in Transformers are based on transformer architecture which utilizes layer normalization,

so does that mean Can I guarantee that the trained model would give same metric performance in both ways? (e.g batch size 64 with 4 batch per device * 4 gpus * 4 accumulation step == batch size 64 with 16 batch per device * 4 gpus )

In short,
My question is for transformer models which use layer normalization, will give same model performance between train batch size in once and using gradient accumulation steps

sgugger · September 10, 2021, 1:18pm

Yes, layer normalization does track statistics, so you will get the exact same thing with 4 batch size * 4 gradient accumulation or 16 batch size (and that would not be the case with neural nets using BatchNorm indeed).

ben9004 · September 10, 2021, 2:50pm

Thanks! your reply has been so much help to me

Topic		Replies	Views
Questions about steps with gradient accumulation Beginners	1	993	July 19, 2023
Batch size vs gradient accumulation Beginners	9	26570	November 28, 2024
What is the limit of grad accumulation? Intermediate	2	2765	May 4, 2021
Does "ViT-B/16" use batch or group normalization? Beginners	0	153	January 15, 2024
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	1904	December 31, 2023

Question about Gradient Accumulation step in Trainer

Related topics