Batch size vs gradient accumulation

Hi,

I have a basic theoretical question. Which one is better for the model and GPU usage?

First option:

--per_device_train_batch_size 8 
--gradient_accumulation_steps 2

Second option:
--per_device_train_batch_size 16

2 Likes

If the second one does not OOM, you should have better performance with it. The first is a way to get around the memory error the second would give you.

Both commands are completely equivalent in terms of training done otherwise.

10 Likes

Hi @sgugger

It is better if you can supply some references on this problem.

2 Likes

A source is not necessary for this, I think. The goal of gradient accumulation is exactly to overcome memory constraints of the hardware.

2 Likes

So where is the difference in performance between using GA and without GA as @sgugger mentioned in his answer?

I am not sure that it just involves hardware only.

1 Like

Using gradient accumulation loops over your forward and backward pass (the number of steps in the loop being the number of gradient accumulation steps). A for loop over the model is less efficient than feeding more data to the model, as you’re not taking advantage of the parallelization your hardware can offer.

The only reason to use gradient accumulation steps is when your whole batch size does not fit on one GPU, so you pay a price in terms of speed to overcome a memory issue.

18 Likes

As far as I am aware, the common rule of thumb to select the batch size is “as big as your hardware can support”. For example, the most recent leaks concerning GPT-4’s training suggest that a staggering batch size of 60M is used. This makes me wonder how an engineer should balance the batch size and gradient accumulation steps hyperparameters. For example, at what point do the potential drawback of increasing gradient accumulation steps outweigh the benefits that are attained by using large batch sizes (I guess this particular question would specifically pertain to the clear performance benefits of large batch sizes vs the possible convergence benefits)?

1 Like

such a beautiful answer bro, it just clicked

1 Like

Isnt it the opposite? Using batched input results in higher memory usage and not gradient accumulation. If gradient accumulation is giving an OOM (Out Of Memory Error), it is guranteed that the first one will also give the same error

1 Like
1 Like