How was the total training time decided?

by Harveenchadha - opened

How did the BigScience team decide the total number of steps and training time?

@Harveenchadha , this might not fully answer your question, but there are some details on this page that you might find informative: /static-proxy?

It does not look converged to me. Both the training and the validation curves suggest that longer training would be beneficial.

BigScience Workshop org
edited Feb 3, 2023

The training time corresponds to one full pass over the training corpus, aka one "epoch".
Training for significantly more than 1 epoch (e.g. 2+ full epochs) would take more compute than was available.

In principle, any party that has several servers with A100s can download model and continue training.

Sign up or log in to comment