My question is regarding transformers.TrainingArguments
class argument. There are two parameter,
- save_total_limit
- load_best_model_at_end
Q1. Let’s just say I have set save_total_limit=50
. But the best model found by the metric doesn’t stay in the last 50 checkpoints. Maybe it is in the last 200 checkpoints.
Now should load_best_model_at_end
will select the best model from the last 50 checkpoints or the entire training duration?
Q2. The problem regarding this is, not always we have large SSD space (or even Regular storage) to train the model. So save_total_limit
is kind of a limited feature based on an individual’s disk space. On the contrary, save_total_limit
on the best checkpoints would be a great feature. In that way, you can even look for the ensemble of multiple checkpoints (may be good for generation tasks).
So is there any way you can save “best 5 checkpoints” (or best X) from the entire training duration?
Note: I tried to read the source code, but too many callback functions to deal with. It would be a great time-saving if someone can help.